Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Race condition in eventloop #36

Closed

Conversation

romanalexander
Copy link

I have geyser running inside a k8s pod (containerd runtime). There's been a persistent bug where clients can't connect half the time in a weird pattern like

Connect Attempt #1 works.
Connect Attempt #2 fails.
Connect Attempt #3 works.
... repeat ...

I isolated it down to RakServerOnlineInitialHandler.channelRead0 channel.eventLoop() responsible for RakChildChannel.setActive is somehow happening too slowly. When this happens, datagrams are read while the channel is still inactive. This causes any UpstreamSession.sendPacketImmediately to silently(!!) fail writing the packet out. This behavior is caused solely by RequestNetworkSettingsPacket trying to immediately flush NetworkSettingsPacket that will never actually be sent to the underlying socket.

I think the best solution is to hold all EncapsulatedPackets in a buffer until the channel becomes active. (Also I have no clue why eventLoop is running like this. Maybe the guest OS being stuck with low resolution timers.)

You can create a reproducible test case with:

                channel.eventLoop().execute(() -> {
                    Thread.sleep(5000);
                    channel.setActive(true); // make sure this is set on same thread
                    channel.pipeline().fireChannelActive();
                });

I attached this as a PR with example solution code that fixes the problem for me. Obviously this isn't an ideal solution. Hopefully someone more well versed in netty can write a better solution.

Additional observations:

  • JRE graalvm-ce-java17-linux-amd64-22.2.0
  • AMD Ryzen 7 5800X 8-Core Processor
  • Host Ubuntu Jellyfish, guest Ubuntu Jammy
  • Tested the problem with TransportMethod.EPOLL and TransportMethod.NIO with the same situation.
  • This occurs only inside a containerd pod. Running the same hardware, same NIC, same jar, same JVM in the host OS works without issue.
  • This has been an ongoing issue for a few months.

@Alemiz112
Copy link
Member

Hi, thanks for opening the issue. So what I understand the actual issue is that sendPacketImmediately() sends bedrock packet before the channel is active?

@Alemiz112
Copy link
Member

Have you tried enabling trace logging? If so, do you see received message from inactive channel or similar messages when this happens?

@romanalexander
Copy link
Author

romanalexander commented Nov 22, 2023

Hi, thanks for opening the issue. So what I understand the actual issue is that sendPacketImmediately() sends bedrock packet before the channel is active?

Yes this is accurate. Due to how Geyser works if you delay sending the packet until after the channelActive it'll crash the client because compression is turned on at the same time that packet is sent. Ideally the incoming datagrams are buffered until channelActive or we modify the API for ChannelFuture access and enable compression only after that packet is placed in the outbound buffer.

Have you tried enabling trace logging? If so, do you see received message from inactive channel or similar messages when this happens?

I tried trace logging in org.cloudburst io.netty org.geyser and the only traces look like this (slightly different due to my own debugging lines)

[16:22:03 INFO] [geyser]: /*.*.*.*:61631 tried to connect! In PacketId 5. notBanned? true
[16:22:04 INFO] [geyser]: /*.*.*.*:61631 tried to connect! In PacketId 5. notBanned? true
[16:22:04 TRACE] [org.cloudburstmc.protocol.bedrock.BedrockSession]: Inbound /*.*.*.*:616310: RequestNetworkSettingsPacket(protocolVersion=622)
[16:22:04 INFO] [geyser]: sendPacketImmediately: isClosed? false Sending NetworkSettingsPacket packet to /*.*.*.*:61631
[16:22:04 TRACE] [org.cloudburstmc.protocol.bedrock.BedrockSession]: Outbound /*.*.*.*:616310: NetworkSettingsPacket(compressionThreshold=512, compressionAlgorithm=ZLIB, clientThrottleEnabled=false, clientThrottleThreshold=0, clientThrottleScalar=0.0)

Nothing relevant following until you press CANCEL in minecraft client.

(edit: redact PII)

@Alemiz112
Copy link
Member

Hi @romanalexander , can you please test if you are able to reproduce this issue with fix/early-messages branch?

@romanalexander
Copy link
Author

Hi @romanalexander , can you please test if you are able to reproduce this issue with fix/early-messages branch?

👍 fix/early-messages seems to work great, thanks!

onebeastchris added a commit to onebeastchris/Geyser that referenced this pull request Dec 20, 2023
bump netty due CVE-2022-41881

Changes in cloudburst libs:
- uses netty 4.1.101.Final
- bumps netty-transport-raknet which fixes CloudburstMC/Network#36, an issue where some connection attempts did not work
onebeastchris added a commit to GeyserMC/Geyser that referenced this pull request Dec 20, 2023
* bump cloudburst's netty-transport-raknet

* bump cloudburst protocol/codec/connection bump
bump netty due CVE-2022-41881

Changes in cloudburst libs:
- uses netty 4.1.101.Final
- bumps netty-transport-raknet which fixes CloudburstMC/Network#36, an issue where some connection attempts did not work
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants