Log lines merged to single message #107

chris-vest · 2019-01-14T12:08:37Z

In certain occasions the logs coming from our services are not correctly parsed. As a consequence, thousands of log lines are crammed together in one big message sent to Sumo Logic. Because the message is so large, it overflows Sumo Logic limit for an individual message and the content is split into multiple messages. As a result, it loses its JSON formatting and it becomes "invisible" for many queries that work under the assumption that the log entries are structured JSON.

This seems to be occurring only with Kafka Connect; when certain operations are performed it logs very noisily.

I have noticed that sumologic-fluentd is being OOM killed. I'm wondering if these log lines are being stuffed into a single message in order to 'catch up' after sumologic-fluentd is restarted?

The text was updated successfully, but these errors were encountered:

bin3377 · 2019-01-22T19:36:33Z

Hi @chris-vest , Thanks for reporting this issue.
It sounds like the multiline detection creating this issue. What's the setting of MULTILINE_START_REGEXP and is there any sample log lines failed to be parsed?

chris-vest · 2019-01-30T15:50:01Z

@bin3377

MULTILINE_START_REGEXP = /^\w{3} \d{1,2}, \d{4}/

Example log: https://pastebin.com/SPr2rxFZ

Just to clarify, this only seems to happen after we enable a new connector - this causes Connect to log a huge amount of lines very quickly, which seems to overload fluentd somehow - causing the massive, squashed log message to end up in SumoLogic. Restarting the fluentd pods then fixes this issue.

bin3377 · 2019-02-13T19:52:50Z

I think there are still something not very clear.
After connect the log to a "huge lines", how it looks like? each line is a valid json and with \n at end?
On the other side, we using fluentd memory buffer by default when streaming out records to Sumo backend. So there are 8MB size limitation of trunk. And Sumo HTTP endpoint has suggestion size of 100KB to 1MB. So I may suggest never generate a single payload over 1MB

chris-vest · 2019-02-14T10:17:49Z

So Kafka Connect is logging very frequently after a new Connector is configured, which ends up overloading fluentd. The resulting log in SumoLogic looks like this - https://pastebin.com/SPr2rxFZ

The initial logs from Kafka Connect are no different to the logs which it normally creates, the only difference is the frequency of those logs.

chris-vest · 2019-05-14T07:58:32Z

I am still seeing this issue, this is what I'm getting from the sumologic-fluentd pods:

sumologic-fluentd-sumologic-fluentd-m8d54 sumologic-fluentd-sumologic-fluentd 2019-05-13 09:42:52 +0000 [warn]: #0 emit transaction failed: error_class=Fluent::Plugin::Buffer::BufferChunkOverflowError error="a 13726192bytes record is larger than buffer chunk limit size" location="/var/lib/gems/2.3.0/gems/fluentd-1.2.6/lib/fluent/plugin/buffer.rb:677:in `block in write_step_by_step'" tag="containers.mnt.log.containers.msk-connect-7c74548dcf-jrv7g_strimzi_msk-connect-c2c8a2ddf227310d7f3cf27ac8ff38c7188e025292e99c42a60d8a57c35e1434.log"

chris-vest · 2019-05-14T09:49:31Z

@bin3377 I've upgraded to 2.3.0 to see if this is mitigated at all.

Looking at the buffer settings, I'm a little confused as to where to put them when configuring fluentd? I know I need to set the chunk_limit_size, but which file should that go in? sumo.out.conf?

chris-vest · 2019-06-07T14:57:06Z

@bin3377 I've tried adding this:

        out.sumo.buffer.conf: |-
          <match **>
            @type sumologic
            log_key "log"
            endpoint xxxxxx
            verify_ssl true
            log_format "json"
            flush_interval 5s
            num_threads 1
            open_timeout 60
            add_timestamp true
            timestamp_key "timestamp"
            proxy_uri ""
            <buffer>
              chunk_limit_size 1m
              flush_thread_count 1
              flush_interval 5s
            </buffer>
          </match>

But I'm still seeing log lines merged into single, large log messages in Sumologic - once this happens, fluentd stops sending messages to Sumologic.

chris-vest · 2019-06-07T15:00:23Z

I think one of your original comments was correct, in that when I cause the event in my cluster, and the application logs busily, the logs are no longer picked up by the regex. However, shouldn't setting the chunk_limit_size mitigate crashing fluentd?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Log lines merged to single message #107

Log lines merged to single message #107

chris-vest commented Jan 14, 2019

bin3377 commented Jan 22, 2019

chris-vest commented Jan 30, 2019

bin3377 commented Feb 13, 2019

chris-vest commented Feb 14, 2019

chris-vest commented May 14, 2019

chris-vest commented May 14, 2019

chris-vest commented Jun 7, 2019 •

edited

Loading

chris-vest commented Jun 7, 2019

Log lines merged to single message #107

Log lines merged to single message #107

Comments

chris-vest commented Jan 14, 2019

bin3377 commented Jan 22, 2019

chris-vest commented Jan 30, 2019

bin3377 commented Feb 13, 2019

chris-vest commented Feb 14, 2019

chris-vest commented May 14, 2019

chris-vest commented May 14, 2019

chris-vest commented Jun 7, 2019 • edited Loading

chris-vest commented Jun 7, 2019

chris-vest commented Jun 7, 2019 •

edited

Loading