Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Too much log output seems to cause processes to freeze #57

Closed
john-rodewald opened this issue Jun 14, 2023 · 21 comments
Closed

Too much log output seems to cause processes to freeze #57

john-rodewald opened this issue Jun 14, 2023 · 21 comments
Labels
bug Something isn't working good first issue Good for newcomers

Comments

@john-rodewald
Copy link

john-rodewald commented Jun 14, 2023

Hi! We use this tool in our team to orchestrate a web application on our development machines. It's been very pleasant so far.

One of the tools in our stack managed by process-compose is Hasura. Specifically, the process is a shell script that sets environment variables and then executes graphql-engine.

For the longest time, we've observed graphql-engine occasionally freezing up for no apparent reason. A process-compose restart is enough to get it to behave again. Today I discovered something new - the problem is:

  • Reliably reproducible by having graphql-engine output more logs (when running it through process-compose)
  • Impossible to reproduce with a lower graphql-engine verbosity (when running it through process-compose)
  • Impossible to reproduce when running graphql-engine outside of process-compose

One way that I can run graphql-engine through process-compose without any freezing is by redirecting stdout to a log file, i.e. adding 1>graphql-engine.log to the process shell script. Setting a log_location and modifying log_level in process-compose.yaml did not seem to fix freezing.

I can't say if the sheer amount of logs is what causes process-compose to choke or if something else is going on. If there exists an error log of process-compose itself or any other information that may be useful, let me know and I'll try to provide.

I'll expand this issue if I discover anything more.

@john-rodewald john-rodewald changed the title Too much log output seems to cause processes to freeze Too much console log output seems to cause processes to freeze Jun 14, 2023
@john-rodewald john-rodewald changed the title Too much console log output seems to cause processes to freeze Too much log output seems to cause processes to freeze Jun 14, 2023
@F1bonacc1
Copy link
Owner

F1bonacc1 commented Jun 14, 2023

Hi @john-rodewald,

Thank you for trying Process Compose and for letting me know about its issues.

A few questions:

  1. Are you running in TUI or TUI disabled mode -t=false?
  2. What version of process-compose are you using?
  3. What is the log_length configured in your process-compose.yaml?
  4. Not sure if your logs have ansi colors if they have, did you try processes.process_name.disable_ansi_colors: true?

Also if you'll give me some numbers, I will try to reconstruct it on my end.

  1. Log lines/minute.
  2. How long does it take to start freezing.

P.S.
The log_level is for process-compose internal log.
You can run process-compose info to see the process-compose log location.

@F1bonacc1 F1bonacc1 added bug Something isn't working good first issue Good for newcomers labels Jun 14, 2023
@john-rodewald
Copy link
Author

john-rodewald commented Jun 15, 2023

  1. Are you running in TUI or TUI disabled mode -t=false?

This is in TUI mode. I've tried with -t=false with no difference in behaviour.

  1. What version of process-compose are you using?

v0.45.0 at this time. The next thing I'll try is to update process-compose to v0.51.0.

  1. What is the log_length configured in your process-compose.yaml?

None is configured at the moment. I now tried setting it to 10 and to 10000 with no difference in behaviour.

  1. Not sure if your logs have ansi colors if they have, did you try processes.process_name.disable_ansi_colors: true?

Tried this with no difference in behaviour.

Log lines/minute.
How long does it take to start freezing.

There's a set of reproduction steps I can follow that reliably causes freezing. It seems that we're in the range of ~100 log lines per second before I see freezing. By the time I observe this peak, the process has already stopped responding.

Reducing verbosity brings me down to a peak of ~20 log lines per second. In this case, no freezing occurs.

I find these numbers to be fairly low, both in absolute and relative terms. 🤔

@F1bonacc1
Copy link
Owner

This is very helpful.
This narrows it down to a completely different place than I suspected at first. And it should be easier to address.
I will try to reconstruct it on my end and optimize the log rotation mechanism.

Another question, this one is probably more specific to your business logic. How do you measure (or experience) the freezes?

@john-rodewald
Copy link
Author

I'm glad to hear! graphql-engine is a GraphQL server that talks to our database. This is a web application with an SPA frontend. My user story goes something like this:

  • I use the application in the browser and requests succeed as expected
  • Next, I follow a set of reproduction steps in a certain module
    • In this example, several entities are created and updated in quick succession, likely causing a lot of log output
  • The application now becomes unresponsive (requests don't succeed anymore, content is no longer loaded)
  • There is no additional log output in the TUI and all requests time out (504) until the process is restarted

We only use process-compose to help orchestrate local development environments and only ever see this problem in those local environments, never in live systems running the same GraphQL server setup.

@dzmitry-lahoda
Copy link

dzmitry-lahoda commented Jun 15, 2023

Seems for me this is case too.

Also reproduced - dump 5MB text file - freezes.

F1bonacc1 added a commit that referenced this issue Jun 15, 2023
@F1bonacc1
Copy link
Owner

F1bonacc1 commented Jun 15, 2023

I created a small test environment that prints about 100K log lines/sec.

When it runs outside process-compose, it takes ~1.1s to print 100K lines:
I am long line number 2676253 ====================================== Duration per 100k: 0:00:01.108581

When inside process-compose, it takes ~1.3s to print 100K lines:
[bad_script_1 ] I am long line number 14839177 ====================================== Duration per 100k: 0:00:01.320320

It was running for a few minutes and was relatively stable [1.2s-1.4s] - no freezes.
With or without saving to file (process_name.log_location undefined) didn't make any difference.

Will you give this test a try on your system?

P.S.
I tested with TUI disabled: process-compose -t=false

@dzmitry-lahoda
Copy link

dzmitry-lahoda commented Jun 15, 2023

Not sure if that response to me) But did you tried one big 5mb or like line? not many lines, but one big?

@dzmitry-lahoda
Copy link

I guess tui may fail on such.

@F1bonacc1
Copy link
Owner

Not sure if that response to me) But did you tried one big 5mb or like line? not many lines, but one big?

I tried to cat 1GB file. OK
But when I replaced \n with in a 20MB file it didn't go so well.
Do you think it's a realistic scenario where a single line is 23040578 chars long?

Not sure that it's the same issue @john-rodewald is facing.

@dzmitry-lahoda
Copy link

20 no, but i have 5 yes. i run wasm hosts, and they sometime respond into logs with wasm blobs base64 encoded one liners.

@F1bonacc1
Copy link
Owner

@dzmitry-lahoda Do you mind opening a separate issue for this scenario?

@john-rodewald
Copy link
Author

john-rodewald commented Jun 16, 2023

I created a small test environment that prints about 100K log lines/sec.

Thanks for checking! I tried running this both with v0.45.0 and with v0.51.0 and everything is stable, as you say.

To rule out issues with the actual log content, I captured graphql-engine logs and printed a set of those in your Python script (essentially replacing the I am long line number... with actual graphql-engine output). Again, no issues at the same speed.

This is bizarre. 🤔

@F1bonacc1
Copy link
Owner

This is bizarre indeed...

A few external factors to consider:

  1. Check your terminal scroll history size, make sure it's less than 10K.
  2. Comment out the processes.hasura.log_location and root level log_location to make sure it's not a persistence delay issue.
  3. Also related to persistence (if enabled) make sure the log_location directory is in exclude list for all kinds of active protection software (antiviruses and such).
  4. As @dzmitry-lahoda found, process compose doesn't handle well too long lines. I still need to check what is considered too long and when things start to break, but in the meanwhile please check the hasura logs to find if there are cases of very long lines. We can compare notes after I write a proper repro 😃

@F1bonacc1
Copy link
Owner

Hi,

@john-rodewald, can you please update if you discovered any additional information that can help to reconstruct the issue?
The scenario described by @dzmitry-lahoda was straightforward to reconstruct. The limit is 2^16 bytes per line, it's the default value defined as MaxScanTokenSize in bufio/scan. It's possible to reconfigure it externally to any other, reasonable value.
I don't think it's related to your case, but I would prefer you to confirm it.

Another question. When you experience those freezes, does the logger stop entirely (until restarted) or slow down for some time? For how long?

@appaquet
Copy link

appaquet commented Jul 6, 2023

I ran into this issue as well. One of my process is logging an excessively long line and it hangs writing to stdout. I had a lot of goroutines in my process stuck on zap logger trying to log to stdout (and doing it under a lock to probably prevent mangling lines).

I recompiled process-compose with pprof enabled and dumped the goroutines. At a glance, nothing seems to be deadlocked, but I do see a bunch of goroutines scanning lines (see https://gist.github.com/appaquet/d960f5f4bf7aae018971735c5f73b0a4), as it should be.

I then tried increasing the max token size to a bigger value, and it does fix my issue:

diff --git a/src/app/process.go b/src/app/process.go
index f595e9c..cf070aa 100644
--- a/src/app/process.go
+++ b/src/app/process.go
@@ -4,7 +4,6 @@ import (
 	"bufio"
 	"context"
 	"fmt"
-	"github.com/f1bonacc1/process-compose/src/types"
 	"io"
 	"math/rand"
 	"os"
@@ -14,6 +13,8 @@ import (
 	"syscall"
 	"time"
 
+	"github.com/f1bonacc1/process-compose/src/types"
+
 	"github.com/f1bonacc1/process-compose/src/command"
 	"github.com/f1bonacc1/process-compose/src/health"
 	"github.com/f1bonacc1/process-compose/src/pclog"
@@ -335,6 +336,7 @@ func (p *Process) handleInput(pipe io.WriteCloser) {
 
 func (p *Process) handleOutput(pipe io.ReadCloser, handler func(message string)) {
 	outscanner := bufio.NewScanner(pipe)
+	outscanner.Buffer(make([]byte, 1024), 1024*1024)
 	outscanner.Split(bufio.ScanLines)
 	for outscanner.Scan() {
 		handler(outscanner.Text())

Perhaps making this max line value configurable per process could be a fix

@F1bonacc1
Copy link
Owner

Hi @appaquet,

You are absolutely right, this is precisely the place that caused it to hang for @dzmitry-lahoda. In one of the latest versions, I added an explicit error printed to the process compose log.
I considered adding it as a configuration parameter, but it felt too low-level for configuration.
I can also increase the default value (2^16), but no matter what I do, at some point, there is going to be someone for whom the default won't be high enough or the new configuration parameter is too hidden.

Maybe it's time for the troubleshooting section in Readme.

I am not sure, though that this is the error @john-rodewald is facing.

@nitrotm
Copy link

nitrotm commented Dec 8, 2023

I can confirm that I have the same problem (some log line are 10MB).

Why not raise it to a large but capped amount of ram? I personally don't care if the buffer can grow up to 128MB in dev mode.

Alternatively, maybe rely on a fail-safe scanner which would yield a partial line if the buffer is exhausted?

@manveru
Copy link

manveru commented Jan 20, 2024

I just hit the same issue with Postgresql when logging all SQL statements and inserting a 2.3MB blob. Thought I was going insane until I noticed that it wasn't my application or postgresql, but process-compose itself that completely froze and had to be killed by force.
I really think that excessively long lines should be truncated, otherwise I might have to start a tmux session around each process it handles, at which point i might as well use tmux standalone anyway...

@dzmitry-lahoda
Copy link

hm, may bw there is some non tmux thing to handle logs?
may be pc should just fail in case of big log gracefully and suggest using that tool. unix philosofy.

btw, can tmux do background sessions?

@F1bonacc1
Copy link
Owner

Fixed. Will be part of the next release.

@F1bonacc1
Copy link
Owner

Fixed in v0.81.4

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

6 participants