Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't frame packets for dead connections #2908

Closed

Conversation

astei
Copy link
Contributor

@astei astei commented Jul 15, 2020

Prior fixes such as #2902, #2901, #2876, and #2719 (probably others) fail to actually fix this issue (known as "nullping" or "bungeesmasher" by users who don't know any better).

This PR contains an actual fix (which is fundamentally very simple), and I will aim to explain the problem in deep detail, and provide some actual thoughts on the matter. This is all stream-of-consciousness, but my motivation for pushing this up now is merely to get this off my mind.

The problem

The problem is, fundamentally speaking, the proxy will continue to decode packets even after the channel that sent them is disconnected. Assuming Netty uses an initial recvbuf size of 1,024 bytes, we can batch 512 properly-framed but invalid packets (one byte for length, the other byte for the packet ID).

This can be very simply evidenced by just sending 512 copies of \0x01\0x00 to the proxy over multiple connections. This will produce around 512 exceptions. Creating an exception is fairly expensive due to stack trace generation (if you believe Aleksey Shipilev).

The solution

See the code - we add a check for Varint21FrameDecoder to silently discard any further packets if the connection was closed afterwards.

A bit of commentary

This community has a severe lackadaisical approach to security problems, discovering them and then locking up both the PoC methods and fixes behind paid resources (with some of these actors being quite insidious). For me to even discover this fix, it took me months of painstakingly extracting information from random people in the community. I have since devised this proven fix and it is being used in Velocity and Paper with great success. I would urge this to get pulled ASAP.

I've brought this up to Mojang directly for them to solve, and hope a fix comes in 1.16.2 or 1.17.

@Janmm14
Copy link
Contributor

Janmm14 commented Jul 15, 2020

Checked with ByteToMessageDecoder source code and can confirm that the intended behaviour of the ByteToMessageDecoder is to empty the remaining buffer when channel gets closed or inactive.

This is not what we want though when we close the channel in all cases, so the fix seems to be at the right place as well.

My refactoring PR did contain a similar check in the past though (got it from some custom bungeecords i've found and decompiled here and there), but your check is better.
I removed it from my PR as I thought that the fix in netty regarding exception handling would be the only problem.

@Janmm14
Copy link
Contributor

Janmm14 commented Jul 15, 2020

One thing though: Please add the same check to the LegacyDecoder

@astei astei marked this pull request as draft Jul 15, 2020
@astei
Copy link
Contributor Author

astei commented Jul 15, 2020

Looks like the fix is incomplete. Investigating further.

@astei
Copy link
Contributor Author

astei commented Jul 16, 2020

Tested this fix, it is now complete.

System used:

  • AMD Ryzen 9 3900X
  • Fedora 32
  • 4GB of heap allocated

Before these changes, the proxy would use north of 550% of CPU. After these changes, it goes to more like ~290%. Further improvements would include reducing the amount of log spam (since the logger thread gets very busy). These results are likely to be better if BungeeCord switched to log4j.

@astei astei marked this pull request as ready for review Jul 16, 2020
astei added a commit to astei/Waterfall that referenced this pull request Jul 16, 2020
electronicboy pushed a commit to PaperMC/Waterfall that referenced this pull request Jul 16, 2020
@md-5
Copy link
Member

md-5 commented Jul 16, 2020

I still think this needs input from netty developers.

It would shock me to find out the best practice for writing a netty server is that every handler in the pipeline must have a if (!active) return at the top of it.

// will fire any cumulated data through the pipeline, so we want to try and stop it here.
if ( !ctx.channel().isActive() )
{
return;
Copy link
Member

@md-5 md-5 Jul 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you skip bytes elsewhere but not here?

Copy link
Contributor Author

@astei astei Jul 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MessageToMessageDecoder doesn't cumulate the buffers like the ByteToMessageDecoder does. We intentionally consume the contents of the message in ByteToMessageDecoder as otherwise it would simply save the messages for later, wasting memory for a connection that we know for certain is dead.

@astei
Copy link
Contributor Author

astei commented Jul 16, 2020

I believe you're right, @md-5. This is an ugly fix to the issue and it's something the Netty developers should take a look at. Do you think it's valuable for me to file an issue?

@astei
Copy link
Contributor Author

astei commented Jul 16, 2020

In the meantime, I think we should go ahead and merge this in, even if it's just in the interim until a future Netty release fixes this (if they decide to fix it).

@md-5
Copy link
Member

md-5 commented Jul 16, 2020

Yes I think you should get their feedback on it. Your code may fix or improve some issue, but right now I'm skeptical that it's just as much as a bandaid as other people have proposed before (eg lets just remove exceptions and replace them with return).

In the general case I can see why you might want to process data for a connection that has already been closed so I don't think that behaviour in itself is a netty bug, but if you don't want to then having to explicitly code each handler that way (as you have done) seems wrong. Having identified the issue I think they would be able to give better or at least authoritative advice on how to fix it.

What happens if you just clear the pipeline on close instead?

@astei
Copy link
Contributor Author

astei commented Jul 16, 2020

Clearing the pipeline explicitly (and adding a handler which blackholes reads) was not any worse than the fix I proposed in this PR. If you think this is the clearest fix I can commit that instead.

It looks like a similar feature was suggested for Netty in netty/netty#3410, just never acted on - perhaps it's time to bring that back to life?

@astei
Copy link
Contributor Author

astei commented Jul 17, 2020

I have developed a Netty change which makes the relevant changes to ByteToMessageDecoder directly. I will test this and if it also solves the issue I will PR it.

@astei
Copy link
Contributor Author

astei commented Jul 17, 2020

I tested it (which was painful) and it didn't work. I'll have to pick this up tomorrow as it's late.

@astei
Copy link
Contributor Author

astei commented Jul 17, 2020

I decided to give this another go and pulling up the isActive() checks directly into the ByteToMessageDecoder works the same as this PR.

However, I'm sure this isn't really the right approach. I'll want to seek some counsel from the Netty developers to be sure I'm going in the right direction. That can wait for the morning.

MrIvanPlays added a commit to MrIvanPlays/IvanCord that referenced this pull request Jul 17, 2020
Upstream has released updates that appears to apply and compile correctly.
This update has not been tested by MrIvanPlays and as with ANY update, please do your own testing

Waterfall Changes:
6f55959 Apply SpigotMC/BungeeCord#2908 (#546)
mikroskeem added a commit to mikroskeemsrealm/Firefly that referenced this pull request Jul 17, 2020
Upstream has released updates that appears to apply and compile correctly.
This update has not been tested by Firefly and as with ANY update, please do your own testing

Waterfall Changes:
6f55959 Apply SpigotMC/BungeeCord#2908 (#546)
md-5 pushed a commit that referenced this pull request Jul 18, 2020
@md-5
Copy link
Member

md-5 commented Jul 18, 2020

I've merged this as its currently the best attempt at fixing the issue, however I still expect to see further work done on it as above

@md-5 md-5 closed this Jul 18, 2020
kennytv added a commit to LuminuNET/LumiWaterfall that referenced this pull request Jul 21, 2020
Upstream has released updates that appears to apply and compile correctly.
This update has not been tested by PaperMC and as with ANY update, please do your own testing

Waterfall Changes:
a619858 Updated Upstream (BungeeCord) (#549)
6f55959 Apply SpigotMC/BungeeCord#2908 (#546)
7ddd12d Fix compile and move version changes out of the wrong patch
7737b0f Updated Upstream (BungeeCord)
4cd9150 Updated Upstream (BungeeCord)
132fd5d Updated Upstream (BungeeCord)
f6b6c61 Updated Upstream (BungeeCord)
8fb8b6c Updated Upstream (BungeeCord)
d2dc905 Updated Upstream and Fix switching servers with disable metadata rewrite (#523)
7ec32c4 Temp 1.16.1 proto support
3781053 Fix POM
7bb1088 Updated Upstream (BungeeCord)
360fdcf Updated Upstream (BungeeCord)
61508bf fix derp in protocol mapping again
52291f0 Fix Entity effect packet mappings
d61b93a Updated Upstream (BungeeCord)
b3296ad Disable metadata rewriting for 1.16 (#520)
29b6e01 Preliminary 1.16 support (#494)
3a8916d Remove version from brand info (Fixes #517)
33c73f9 Use vanilla maximum for tab completion packets (Fixes #512)
717477a Updated Upstream (BungeeCord)
26f574b Cache some more common exceptions
@linsaftw
Copy link

linsaftw commented Jul 23, 2020

Hi, sadly this didn't patch the issue either.

image

It's the first server of the list, as you can see is unpingable. I tried my "patch" and this patch that was promising. I hope we can find a fix for this soon.

Btw, my gf said that connections are sending only a "localhost" address on status packet (MC|Ping or something like that, she diagnosed this on nodejs) and probably this has something to do with this, but i think it has to do that the connection never sends the ping packet, which causes bungee to wait until connection sends the ping packet.

EDIT: I think it sends only status packet then bungee waits for ping which is never sent and timeouts after 30 secs, which allows "empty" connections to maintain open to overload bungeecord.

Sorry if i said something wrong, i am not experienced at this and i just want a patch.

electronicboy pushed a commit to PaperMC/Travertine that referenced this pull request Jul 24, 2020
Upstream has released updates that appears to apply and compile correctly.
This update has not been tested by PaperMC and as with ANY update, please do your own testing

Waterfall Changes:
a619858 Updated Upstream (BungeeCord) (#549)
6f55959 Apply SpigotMC/BungeeCord#2908 (#546)
@MrIvanPlays
Copy link
Contributor

MrIvanPlays commented Jul 27, 2020

EDIT: I think it sends only status packet then bungee waits for ping which is never sent and timeouts after 30 secs, which allows "empty" connections to maintain open to overload bungeecord.

then it looks like that timeouts doesn't fully close the connection????

@linsaftw
Copy link

linsaftw commented Jul 27, 2020

EDIT: I think it sends only status packet then bungee waits for ping which is never sent and timeouts after 30 secs, which allows "empty" connections to maintain open to overload bungeecord.

then it looks like that timeouts doesn't fully close the connection????

Yes they do, but they close after 30 seconds which means that after 30 seconds we already have 300.000 connections established in a 10.000 connections per second attack. Damn Minecraft.

@MrIvanPlays
Copy link
Contributor

MrIvanPlays commented Jul 27, 2020

EDIT: I think it sends only status packet then bungee waits for ping which is never sent and timeouts after 30 secs, which allows "empty" connections to maintain open to overload bungeecord.

then it looks like that timeouts doesn't fully close the connection????

Yes they do, but they close after 30 seconds which means that after 30 seconds we already have 300.000 connections established in a 10.000 connections per second attack. Damn Minecraft.

You can make this delay smaller actually in bungee config. We should probably make it smaller by default too

@linsaftw
Copy link

linsaftw commented Jul 28, 2020

EDIT: I think it sends only status packet then bungee waits for ping which is never sent and timeouts after 30 secs, which allows "empty" connections to maintain open to overload bungeecord.

then it looks like that timeouts doesn't fully close the connection????

Yes they do, but they close after 30 seconds which means that after 30 seconds we already have 300.000 connections established in a 10.000 connections per second attack. Damn Minecraft.

You can make this delay smaller actually in bungee config. We should probably make it smaller by default too

Yeah but players with high ms will get kicked, plus, you would need to reduce it to a really low value to avoid netty from crashing.

@astei
Copy link
Contributor Author

astei commented Jul 29, 2020

@linsaftw Or you can strictly validate ping requests to the proxy - which is what I did just a few days ago.

I am not going to be working with BungeeCord any longer, I moved onto my own project (Velocity) and have no desire to touch BungeeCord any more. I provided this PR as a public service and now my job is done.

@md-5
Copy link
Member

md-5 commented Jul 29, 2020

I am not going to be working with BungeeCord any longer, I moved onto my own project (Velocity) and have no desire to touch BungeeCord any more.

Sigh, and this is why I have so much reticence at pulling bandaid fixes. Whilst I appreciate your contribution, this is as clear an example as ever why pulling such fixes is a bad idea --- despite promises or requests to the contrary they never actually get improved and we land stuck with a messy/incomplete solution rather than the issue remaining open and a proper solution being created.

I provided this PR as a public service and now my job is done.

And this is the second problem. Whilst perhaps not an issue in this case, actually standing by your contributions is an important part of the open source community. Contributing is not supposed to be an exercise where you drop by, make the code the maintainer's problem because its convenient for you and then disappear forever (inb4 something is better than nothing). To make an example of @Mystiflow and the chat API: I have little problem pulling relatively large contributions relatively quickly because I know that they will keep on top of their code by assisting with updates and maintenance rather than just unilaterally adding to the maintenance burden.

Again, I am not trying to make an example of this specific PR or even this specific remark, but I do wish to make the point clear to everyone that open source contributions are a two way street --- yes they are appreciated, but they still come with responsibilities and the more you commit to those responsibilities either in fact or by your attitude the better it is for everyone.

@astei
Copy link
Contributor Author

astei commented Jul 29, 2020

Whilst I appreciate your contribution, this is as clear an example as ever why pulling such fixes is a bad idea --- despite promises or requests to the contrary they never actually get improved and we land stuck with a messy/incomplete solution rather than the issue remaining open and a proper solution being created.

Upstream has thus far shown no enthusiasm. The chances of them creating a better solution seem slim.

I do have some work I've been doing on ByteToMessageDecoder and do intend to upstream it so as to make a more efficient (and proper) solution but have simply lacked the time to do so. Even then, the fact the issue has now sat silent with no activity from the Netty team shows they simply don't view this issue as a priority.

actually standing by your contributions is an important part of the open source community

Perhaps I've been a bit standoffish with my earlier statement. I do stand by this contribution as it is the best available solution (planned Netty PR notwithstanding). I do not proclaim it as the ideal solution and never have. My statement refers quite simply to the fact that I am simply not strongly invested in the development of BungeeCord (and haven't since at least 2017).

For the record, for this specific issue, it affects my project(s) as well, so I have a rather strong incentive to upstream a "proper" fix into Netty proper. But as it presently stands, it has simply not been a priority for me.

@xxDark
Copy link

xxDark commented Jul 29, 2020

We have already discussed this in #waterfall-dev channel. I've made some changes PipelineUtils not to keep connections opened for long enough.
See xxDark@3cfc8c0
Although the current solution may reduce the load on CPU, ByteToMessageDecoder#callDecode is still called, and it's not perfect.
See https://github.com/xxDark/BungeeCord/blob/3cfc8c0ea94c2d2b6773e898bac1889fd262b16a/shared/src/main/java/net/md_5/bungee/util/ChannelDiscardHandler.java
@linsaftw can you test my branch?

@linsaftw
Copy link

linsaftw commented Jul 29, 2020

We have already discussed this in #waterfall-dev channel. I've made some changes PipelineUtils not to keep connections opened for long enough.
See xxDark@3cfc8c0
Although the current solution may reduce the load on CPU, ByteToMessageDecoder#callDecode is still called, and it's not perfect.
See https://github.com/xxDark/BungeeCord/blob/3cfc8c0ea94c2d2b6773e898bac1889fd262b16a/shared/src/main/java/net/md_5/bungee/util/ChannelDiscardHandler.java
@linsaftw can you test my branch?

You basically forced the timeout to 3000ms. I think it will work but damn, its not the best solution. I hope there is something better.

@xxDark
Copy link

xxDark commented Jul 29, 2020

We have already discussed this in #waterfall-dev channel. I've made some changes PipelineUtils not to keep connections opened for long enough.
See xxDark@3cfc8c0
Although the current solution may reduce the load on CPU, ByteToMessageDecoder#callDecode is still called, and it's not perfect.
See https://github.com/xxDark/BungeeCord/blob/3cfc8c0ea94c2d2b6773e898bac1889fd262b16a/shared/src/main/java/net/md_5/bungee/util/ChannelDiscardHandler.java
@linsaftw can you test my branch?

You basically forced the timeout to 3000ms. I think it will work but damn, its not the best solution. I hope there is something better.

Yes, but I restore it back later, once the player fully logs in.
EDIT: xxDark@3cfc8c0#diff-d9d3284e91fdccd1cd177c5895ea89c8R507

@xxDark
Copy link

xxDark commented Jul 29, 2020

I don't think that login process under normal conditions takes longer than 3 seconds.

@MrIvanPlays
Copy link
Contributor

MrIvanPlays commented Jul 29, 2020

I don't think that login process under normal conditions takes longer than 3 seconds.

Maybe this should be configurable (in cases where the server is in say europe but the major playerbase is in say asia)

@xxDark
Copy link

xxDark commented Jul 29, 2020

I don't think that login process under normal conditions takes longer than 3 seconds.

Maybe this should be configurable (in cases where the server is in say europe but the major playerbase is in say asia)

Yes, that can be made configurable, of course. My branch just contains fixes that I previously used in my private fork.
I'm not planning on adding any other features to it, I only maintain the solution I published.

@FurmigaHumana
Copy link

FurmigaHumana commented Jul 29, 2020

This missed the point a little bit, first of all there isnt much that can be done, all we can do is to make as cheap as possible to handle new connections so it takes longer to crash down but if the attacker can overpower your hosting provider it will still crash nonetheless.

While being attack there will be tons of connections, there is no way to prevent that, the issue is that BungeeCord still try for way to long to handle the connection, when it finaly fails it generates expensive exceptions and logs.. #2876 is the only one who actualy made handling new connections a little cheaper, reducing logging, memory copy and exceptions.

In our network the only solution was to remove all logging to preserve storage ops, silently close connections without generating any trash like exceptions, forcefuly closing sockets as soon as an invalid byte has been read (an ban the ip for a few moments) and not making any copies or anything remotely expensive before the proxy is sure that it is indeed a valid connection, in this situation every bit counts and we havent been taken down by these attacks in a while. We use a private branch of bungeecord that no longer follows upstream but unless some harsh actions are taken this issue (if it even can be considered an issue) wont be fixed.

@xxDark
Copy link

xxDark commented Jul 29, 2020

I've made two simple changes,
xxDark@26a9c3c#diff-24ac580564f5cf98d6248bab1eda55fdR29
xxDark@75881ed#diff-24ac580564f5cf98d6248bab1eda55fdR30
Plus (just like before) added an outbound handler that stops all incoming messages and discards writes/exceptions.

4 reads in total vs 512.
Amount of reads was calculated by simply incrementing an amount of decode calls, printed in handlerRemoved0.

@linsaftw
Copy link

linsaftw commented Jul 30, 2020

This missed the point a little bit, first of all there isnt much that can be done, all we can do is to make as cheap as possible to handle new connections so it takes longer to crash down but if the attacker can overpower your hosting provider it will still crash nonetheless.

While being attack there will be tons of connections, there is no way to prevent that, the issue is that BungeeCord still try for way to long to handle the connection, when it finaly fails it generates expensive exceptions and logs.. #2876 is the only one who actualy made handling new connections a little cheaper, reducing logging, memory copy and exceptions.

In our network the only solution was to remove all logging to preserve storage ops, silently close connections without generating any trash like exceptions, forcefuly closing sockets as soon as an invalid byte has been read (an ban the ip for a few moments) and not making any copies or anything remotely expensive before the proxy is sure that it is indeed a valid connection, in this situation every bit counts and we havent been taken down by these attacks in a while. We use a private branch of bungeecord that no longer follows upstream but unless some harsh actions are taken this issue (if it even can be considered an issue) wont be fixed.

We did that on our FlameCord fork and sadly its not enough. The solution atm is to reduce the timeout time to 5000 or something like that in the config.yml. Sadly.

Paulomart pushed a commit to MadGamble/BungeeCord that referenced this pull request Sep 18, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants