Skip to content

Try fixing memory leak for syslog input#24525

Closed
fpetersen-gl wants to merge 11 commits intomasterfrom
syslog_memleak
Closed

Try fixing memory leak for syslog input#24525
fpetersen-gl wants to merge 11 commits intomasterfrom
syslog_memleak

Conversation

@fpetersen-gl
Copy link
Contributor

@fpetersen-gl fpetersen-gl commented Dec 11, 2025

Usually the netty-component calling a retain() on a ByteBuf is also responsible for release()ing it. Until now, there is a retain() in the SyslogTCPFramingRouterHandler, but we only release() it in case of an exception. This is on purpose, as when it comes to multi-frame messages, the first part(s) of the message were already released and thus not accessible, when the decoder is trying to do its job.

Now, we keep track of all messages we retain() and release them once the channel (usually network-connection) is closed or an exception occurs. That way we can be sure to release all allocated buffers, while keeping the buffers around for long enough to also be able to decode them correctly.

Things to verify:

  • Is the SyslogTCPFramingRouterHandler a singleton (possible threading-issues on the collection of retainedBuffers)?
    No, it isn't. Netty is using it as part of its event-loop, guaranteeing that there's no other thread accessing it at the same time. Also, it doesn't contain shared state (across multiple channels), so it should be ok to keep track of the retained messages this way.

Description

Motivation and Context

How Has This Been Tested?

Screenshots (if appropriate):

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Refactoring (non-breaking change)
  • Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

  • My code follows the code style of this project.
  • My change requires a change to the documentation.
  • I have requested a documentation update.
  • I have read the CONTRIBUTING document.
  • I have added tests to cover my changes.

@fpetersen-gl fpetersen-gl requested a review from Copilot December 11, 2025 14:56
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR attempts to fix a memory leak in the syslog TCP input by improving the management of retained ByteBuf references. The handler now tracks all retained buffers and ensures they are released when the channel closes or an exception occurs, preventing accumulated memory leaks from multi-frame messages.

Key changes:

  • Added tracking of retained buffers in a collection for proper lifecycle management
  • Implemented cleanup logic in channelInactive() and exceptionCaught() methods
  • Added test coverage for the problematic syslog message scenario

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 10 comments.

File Description
SyslogTCPFramingRouterHandler.java Adds buffer tracking collection and cleanup methods to ensure all retained ByteBuf instances are released
SyslogTCPFramingRouterHandlerTest.java Adds test case for problematic syslog message and debug logging to existing tests
not_working_syslog_msg_1.txt Adds test resource file containing a real-world syslog message that triggers the memory leak scenario
Comments suppressed due to low confidence (1)

graylog2-server/src/test/java/org/graylog2/inputs/syslog/tcp/SyslogTCPFramingRouterHandlerTest.java:1

  • The test resource file contains a timestamp in the future (2025-10-31). Test data should use dates in the past to avoid confusion and potential time-based test failures.
/*

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

}

@Override
public void channelInactive(ChannelHandlerContext ctx) throws Exception {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am bit worried about this part, would memory here potentially grow unlimited if the channel is high active?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what I understood, in netty terms a "channel" is basically a single network-connection to a port. So once a single transmission of a message is finished, the channel becomes inactive. OTOH, I'm not sure if remote syslog-logger are keeping the connection open and pumping a gazillion messages in one go.

From a different perspective: The retained ByteBuf's exist in out direct (off-heap) memory already. This is just adding the collection, which keeps track of the ByteBuf-objects, so compared to them, it should be rather cheap memory-wise.

@boosty boosty linked an issue Dec 12, 2025 that may be closed by this pull request
2 tasks
@fpetersen-gl fpetersen-gl marked this pull request as ready for review December 12, 2025 13:51
@fpetersen-gl
Copy link
Contributor Author

One challenge that popped up:
The ByteBufs are kept as long as no exception occurs and the TCP-connection is open. This can be a long time. How can the buffers be cleaned up in the meantime?

@vvasylenko
Copy link

vvasylenko commented Jan 22, 2026

Tested. I have created a script to send Syslog TCP messages, sent 100K messages to Graylog single node and to Graylog with two nodes. I haven't seen any data loss. Also no memory leaks.
Here is the script:

#!/bin/sh

HOST="localhost"
PORT="5140"
COUNT="100000"
DELAY="0.02"

APPNAME="testapp"
PROCID="sh"
MSGID="TEST"
RUN_ID="$(date +%Y%m%d-%H%M%S)"
LOCAL_HOST="$(hostname | cut -d. -f1)"

i=1
while [ "$i" -le "$COUNT" ]; do
  TS="$(date -u '+%Y-%m-%dT%H:%M:%SZ')"

  # RFC5424: <PRI>VERSION TIMESTAMP HOST APP PROCID MSGID STRUCTURED-DATA MSG
  SYSLOG_MSG="<13>1 ${TS} ${LOCAL_HOST} ${APPNAME} ${PROCID} ${MSGID} - run=${RUN_ID} seq=${i}"

  # RFC6587 octet-counting framing for TCP: "<len> <syslog_msg>"
  LEN="$(printf "%s" "$SYSLOG_MSG" | LC_ALL=C wc -c | tr -d ' ')"

  # Send framed syslog message via nc, then close the connection
  printf "%s %s" "$LEN" "$SYSLOG_MSG" | nc "$HOST" "$PORT"

  sleep "$DELAY"
  i=$((i + 1))
done

On request here is also some graphics:
Graylog 1 node
Screenshot 2026-01-23 at 10 33 22

Graylog 2 node
Screenshot 2026-01-23 at 10 33 40

@patrickmann patrickmann self-requested a review February 25, 2026 08:44
@patrickmann
Copy link
Contributor

Claude suggests these improvements to the unit tests:

  1. Test uses maxCapacity() - 1 instead of readableBytes()
    In testNotWorkingSyslogMessage(), lines 145-146:
    final String actual = result.readCharSequence(result.maxCapacity() - 1, StandardCharsets.US_ASCII).toString();
    This reads one fewer byte than the actual frame. The expected string computed with bytes.length - 2 also drops an extra byte. Both sides drop the same character, so the test passes for the wrong reason and fails to verify the last byte.

Fix: Use result.readableBytes() for actual, and bytes.length - 1 for expected (only stripping the \n
delimiter).

  1. Leaked EmbeddedChannel from setUp()
    In testNotWorkingSyslogMessage(), line 138:
    channel = new EmbeddedChannel(new SyslogTCPFramingRouterHandler(380000, Delimiters.lineDelimiter()));
    This overwrites the channel field without finishing the one created by setUp(). The old channel's resources leak.

Fix: Call channel.finishAndReleaseAll() before reassigning, or use a local variable.

@patrickmann
Copy link
Contributor

More interestingly, Claude also suggests an improvement to address long-lived connections:

Architectural: retainedBuffers tracking is unnecessary — use super(false) instead

The retainedBuffers + removeIf approach fights against SimpleChannelInboundHandler's auto-release. The removeIf(buf -> buf.refCnt() == 0) runs inside channelRead0() before the auto-release fires, so it almost never prunes anything.
On long-lived connections, buffers accumulate in the list. The cleanest fix is super(false) to disable auto-release entirely — ByteToMessageDecoder already manages buffer lifecycle through its cumulator. This eliminates the retainedBuffers list, removeIf, cleanup(), and exceptionCaught override. Additionally, forwarding channelInactive to the sub-handler fixes a pre-existing cumulation buffer leak.

@patrickmann
Copy link
Contributor

The memory leak only triggers when a TCP connection closes with incomplete frame data sitting in the ByteToMessageDecoder's cumulation buffer. Complete messages are fully consumed and released — so sending well-formed syslog messages will never reproduce the leak.

Here's a test script: syslog_leak_test.sh
This script reliably triggered netty leak detection after at most 3 runs on master (launch Graylog with
-Dio.netty.leakDetection.level=paranoid).

@fpetersen-gl
Copy link
Contributor Author

Closing in favor of #25120

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

syslog-message doesn't show up in search and seems to cause memory-leak

5 participants