Parallel logging interference (I assume) #379

aktau · 2014-06-21T10:44:12Z

Hey! I've been noticing something lately:

[2014-06-21 01:57:18] app.DEBUG: [/next_item] requested item (19803) with duration (8.0s) will start playing in (50.0s), so we expect communication by 2014-06-21 01:58:21 [] []
[2014-06-21 01:57:22] app.DEBUG: [/next_item] requested item (18504) with duration (10.0s) will start playing in (12.0s), so we expect[2014-06-21 02:15:56] app.DEBUG: [/next_item] requested item (17933) with duration (24.0s) will start playing in (30.0s), so we expect communication by 2014-06-21 02:16:55 [] []
[2014-06-21 02:15:56] app.DEBUG: [/next_item] requested item (18507) with duration (10.0s) will start playing in (10.0s), so we expect communication by 2014-06-21 02:16:21 [] []

As you can see, the logs lines interfere, which of course messes with my automated log parsing as well.

This is all from exactly 1 webapp that's firing requests at the server, some of those requests may indeed be spaced very closely in time. I reckon the problem might be worse when you have a lot of concurrent web users. I have 2 PHP-FPM processes running.

What information could I provide to help better alleviate the issue?

The text was updated successfully, but these errors were encountered:

stof · 2014-06-21T16:35:50Z

it looks like the end of a message has been completely discarded (which includes the final newline)

aktau · 2014-06-21T16:36:24Z

Indeed it does. Which makes it even stranger.

Seldaek · 2014-06-22T18:52:13Z

This is very weird assuming you use the StreamHandler or a derivative, because it writes to disk with a single fwrite() call to a file that's been opened in append mode, so the OS IO should take care of doing this atomically and not mix things in like this. Is it something that occurs regularly?

stof · 2014-06-22T18:53:26Z

@Seldaek it is not mixed. It is only that one of the log message is incomplete. The end is missing (which includes the newline char)

Seldaek · 2014-06-22T18:55:10Z

Yup true, but still sounds like an OS fail, not something we can help.

aktau · 2014-06-22T20:33:44Z

I'm on debian 7.0 (with a vanilla Linux 3.15.1 kernel, ext4), by the way. I also find it strange that (seemingly) a part of the message is discarded, extremely curious.

Such a bug in the fs driver would have been noticed, I bet.

Seldaek · 2014-06-23T12:11:04Z

One would hope so yes, but then again bugs occurring rarely with highly concurrent constraint can be hard to spot. Anyway I don't really want to blame anyone here but I just don't see what I can do about it unfortunately given the information at hand and that the code is as (AFAIK) correct.

bobrik · 2014-06-26T12:40:57Z

So yeah, locking is needed for concurrent access.

staabm · 2014-06-26T12:49:13Z

@bobrik the example you posted above looks shorter than the 4096bytes which should then be written as a atomic operation..? maybe the buffer in your server is shorter though.

Seldaek · 2014-06-26T13:00:33Z

Yup, also locking is kinda impossible to do IMO. You don't want to throw an exception OR block a request until it can acquire a lock on a log file. Then if you only release it at the end that means you end up having a concurrency limit of 1 request at a time pretty much. Quite the performance killer.

bobrik · 2014-06-26T13:03:29Z

@Seldaek concurrency limit is 1 fwrite at a time per log file and write is usually pretty fast. If it's slow, there's problem is somewhere else.

Seldaek · 2014-06-26T13:13:12Z

Yeah if you flock/fwrite/unlock for every write, but I assume that's a ton more IO to solve a problem that isn't exactly proven to exist with short line lengths (which is mostly what gets logged). I mean in the 3 years that monolog has been used in production now on many sites it's the first time someone spots this problem. I don't know if it's because people weren't looking or because it didn't happen, but I don't want to penalize performance and "overreact" here if there is no issue. Doing optional locking would for sure be an option (and a good thing to have) though.

bobrik · 2014-06-26T13:13:25Z

Just measured:

<?php

$fd = fopen('test.log', 'a');

$s = microtime(true);
if (flock($fd, LOCK_EX)) {
    fwrite($fd, str_repeat('wooohooo', 500) . "\n");
    flock($fd, LOCK_UN);
    echo sprintf("%.10f\n", (microtime(true) - $s) * 1000);
}

Results in microseconds (not milliseconds even!):

% for i in `seq 1 10`; do php -f test.php; done
0.0488758087
0.0460147858
0.0450611115
0.0429153442
0.0438690186
0.0438690186
0.0679492950
0.0500679016
0.0429153442
0.0460147858

without flock:

% for i in `seq 1 10`; do php -f test.php; done
0.0441074371
0.0448226929
0.0450611115
0.0450611115
0.0460147858
0.0450611115
0.0460147858
0.0460147858
0.0441074371
0.0460147858

Doesn't look like performance killer to me. Light travels 200 meters in optic fiber in 1 microsecond, by the way.

Seldaek · 2014-06-26T13:15:10Z

OK looks acceptable indeed, though I'd like to test some more but thanks for the quick test!

aktau · 2014-06-26T13:18:20Z

It's possible that this has something or other to do with newer kernel versions (3.14+) (fs drivers) using more concurrency, as I had never seen this before.

staabm · 2014-06-26T14:02:39Z

@bobrik I think the performance difference gets bigger with more concurrency and when the system is under heavy (IO) load.

bobrik · 2014-06-26T14:56:03Z

@staabm I'd love to see evidence of that. Why should linux make it slow when it could be fast? I don't see any reason.

I did some more testing just to convince you.

Write 1000000 messages to file with 1 process with flock:

23741.071701ms
4001000000bytes
1f340a25f1afb8bb538785c088bb485b

Write 100000 messages from each of 10 concurrent processes with flock:

22741.5876389
22776.4406204
23544.3735123
23877.5174618
24117.3708439
24102.0817757
24263.3070946
24295.6175804
24387.7398968
24602.2770405ms - this is the last process
4001000000 bytes
1f340a25f1afb8bb538785c088bb485b

Write 1000000 messages to file with 1 process without locking:

23303.2646179ms
4001000000bytes
1f340a25f1afb8bb538785c088bb485b

Write 100000 messages from each of 10 concurrent processes without locking:

23038.921833ms
23169.2922115ms
23385.9410286ms
24201.3356686ms
24267.1320438ms
24330.3422928ms
24429.9705029ms
24827.4047375ms
24915.2333736ms
25041.1190987ms - this is the last process 
4001000000 bytes
1f340a25f1afb8bb538785c088bb485b

In all 4 cases disk was utilized by 100%. As you can imagine, with 10 concurrent writers fwrite were 10x slower in both cases, but finish time stayed roughly the same in all tests.

Seldaek · 2014-06-28T17:21:05Z

@bobrik thanks! Would you or @aktau want to work on a PR for the StreamHandler's write method? The one question left is what to do in case the locking fails. I would tend towards ignoring failures and just calling fwrite anyway. It sounds a bit dirty but considering all platforms and potential use cases since flock() blocks until a lock can be acquired or it failed to do so I think ignoring is the right thing to do to keep this working everywhere.

aktau · 2014-06-28T17:22:51Z

Good research guys. I have a little bit too little time to go through the PR waves at the moment. Bit too occupied with work and neovim.

Seldaek · 2014-07-28T18:19:54Z

@aktau You can now create a StreamHandler with the last constructor arg (useLocking) set to true to make it lock before every write.

Seldaek added the Bug label Jun 22, 2014

bobrik mentioned this issue Jun 30, 2014

Support for log file locking #384

Merged

Seldaek closed this as completed in #384 Jul 28, 2014

Seldaek mentioned this issue Jun 17, 2018

StreamHandler: Set $useLocking to true by default #1054

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallel logging interference (I assume) #379

Parallel logging interference (I assume) #379

aktau commented Jun 21, 2014

stof commented Jun 21, 2014

aktau commented Jun 21, 2014

Seldaek commented Jun 22, 2014

stof commented Jun 22, 2014

Seldaek commented Jun 22, 2014

aktau commented Jun 22, 2014

Seldaek commented Jun 23, 2014

bobrik commented Jun 26, 2014

staabm commented Jun 26, 2014

Seldaek commented Jun 26, 2014

bobrik commented Jun 26, 2014

Seldaek commented Jun 26, 2014

bobrik commented Jun 26, 2014

Seldaek commented Jun 26, 2014

aktau commented Jun 26, 2014

staabm commented Jun 26, 2014

bobrik commented Jun 26, 2014

Seldaek commented Jun 28, 2014

aktau commented Jun 28, 2014

Seldaek commented Jul 28, 2014

Parallel logging interference (I assume) #379

Parallel logging interference (I assume) #379

Comments

aktau commented Jun 21, 2014

stof commented Jun 21, 2014

aktau commented Jun 21, 2014

Seldaek commented Jun 22, 2014

stof commented Jun 22, 2014

Seldaek commented Jun 22, 2014

aktau commented Jun 22, 2014

Seldaek commented Jun 23, 2014

bobrik commented Jun 26, 2014

staabm commented Jun 26, 2014

Seldaek commented Jun 26, 2014

bobrik commented Jun 26, 2014

Seldaek commented Jun 26, 2014

bobrik commented Jun 26, 2014

Seldaek commented Jun 26, 2014

aktau commented Jun 26, 2014

staabm commented Jun 26, 2014

bobrik commented Jun 26, 2014

Seldaek commented Jun 28, 2014

aktau commented Jun 28, 2014

Seldaek commented Jul 28, 2014