Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallel logging interference (I assume) #379

Closed
aktau opened this issue Jun 21, 2014 · 20 comments · Fixed by #384
Closed

Parallel logging interference (I assume) #379

aktau opened this issue Jun 21, 2014 · 20 comments · Fixed by #384
Labels

Comments

@aktau
Copy link

aktau commented Jun 21, 2014

Hey! I've been noticing something lately:

[2014-06-21 01:57:18] app.DEBUG: [/next_item] requested item (19803) with duration (8.0s) will start playing in (50.0s), so we expect communication by 2014-06-21 01:58:21 [] []
[2014-06-21 01:57:22] app.DEBUG: [/next_item] requested item (18504) with duration (10.0s) will start playing in (12.0s), so we expect[2014-06-21 02:15:56] app.DEBUG: [/next_item] requested item (17933) with duration (24.0s) will start playing in (30.0s), so we expect communication by 2014-06-21 02:16:55 [] []
[2014-06-21 02:15:56] app.DEBUG: [/next_item] requested item (18507) with duration (10.0s) will start playing in (10.0s), so we expect communication by 2014-06-21 02:16:21 [] []

As you can see, the logs lines interfere, which of course messes with my automated log parsing as well.

This is all from exactly 1 webapp that's firing requests at the server, some of those requests may indeed be spaced very closely in time. I reckon the problem might be worse when you have a lot of concurrent web users. I have 2 PHP-FPM processes running.

What information could I provide to help better alleviate the issue?

@stof
Copy link
Contributor

stof commented Jun 21, 2014

it looks like the end of a message has been completely discarded (which includes the final newline)

@aktau
Copy link
Author

aktau commented Jun 21, 2014

Indeed it does. Which makes it even stranger.

@Seldaek
Copy link
Owner

Seldaek commented Jun 22, 2014

This is very weird assuming you use the StreamHandler or a derivative, because it writes to disk with a single fwrite() call to a file that's been opened in append mode, so the OS IO should take care of doing this atomically and not mix things in like this. Is it something that occurs regularly?

@Seldaek Seldaek added the Bug label Jun 22, 2014
@stof
Copy link
Contributor

stof commented Jun 22, 2014

@Seldaek it is not mixed. It is only that one of the log message is incomplete. The end is missing (which includes the newline char)

@Seldaek
Copy link
Owner

Seldaek commented Jun 22, 2014

Yup true, but still sounds like an OS fail, not something we can help.

@aktau
Copy link
Author

aktau commented Jun 22, 2014

I'm on debian 7.0 (with a vanilla Linux 3.15.1 kernel, ext4), by the way. I also find it strange that (seemingly) a part of the message is discarded, extremely curious.

Such a bug in the fs driver would have been noticed, I bet.

@Seldaek
Copy link
Owner

Seldaek commented Jun 23, 2014

One would hope so yes, but then again bugs occurring rarely with highly concurrent constraint can be hard to spot. Anyway I don't really want to blame anyone here but I just don't see what I can do about it unfortunately given the information at hand and that the code is as (AFAIK) correct.

@bobrik
Copy link
Contributor

bobrik commented Jun 26, 2014

So yeah, locking is needed for concurrent access.

@staabm
Copy link
Contributor

staabm commented Jun 26, 2014

@bobrik the example you posted above looks shorter than the 4096bytes which should then be written as a atomic operation..? maybe the buffer in your server is shorter though.

@Seldaek
Copy link
Owner

Seldaek commented Jun 26, 2014

Yup, also locking is kinda impossible to do IMO. You don't want to throw an exception OR block a request until it can acquire a lock on a log file. Then if you only release it at the end that means you end up having a concurrency limit of 1 request at a time pretty much. Quite the performance killer.

@bobrik
Copy link
Contributor

bobrik commented Jun 26, 2014

@Seldaek concurrency limit is 1 fwrite at a time per log file and write is usually pretty fast. If it's slow, there's problem is somewhere else.

@Seldaek
Copy link
Owner

Seldaek commented Jun 26, 2014

Yeah if you flock/fwrite/unlock for every write, but I assume that's a ton more IO to solve a problem that isn't exactly proven to exist with short line lengths (which is mostly what gets logged). I mean in the 3 years that monolog has been used in production now on many sites it's the first time someone spots this problem. I don't know if it's because people weren't looking or because it didn't happen, but I don't want to penalize performance and "overreact" here if there is no issue. Doing optional locking would for sure be an option (and a good thing to have) though.

@bobrik
Copy link
Contributor

bobrik commented Jun 26, 2014

Just measured:

<?php

$fd = fopen('test.log', 'a');

$s = microtime(true);
if (flock($fd, LOCK_EX)) {
    fwrite($fd, str_repeat('wooohooo', 500) . "\n");
    flock($fd, LOCK_UN);
    echo sprintf("%.10f\n", (microtime(true) - $s) * 1000);
}

Results in microseconds (not milliseconds even!):

% for i in `seq 1 10`; do php -f test.php; done
0.0488758087
0.0460147858
0.0450611115
0.0429153442
0.0438690186
0.0438690186
0.0679492950
0.0500679016
0.0429153442
0.0460147858

without flock:

% for i in `seq 1 10`; do php -f test.php; done
0.0441074371
0.0448226929
0.0450611115
0.0450611115
0.0460147858
0.0450611115
0.0460147858
0.0460147858
0.0441074371
0.0460147858

Doesn't look like performance killer to me. Light travels 200 meters in optic fiber in 1 microsecond, by the way.

@Seldaek
Copy link
Owner

Seldaek commented Jun 26, 2014

OK looks acceptable indeed, though I'd like to test some more but thanks for the quick test!

@aktau
Copy link
Author

aktau commented Jun 26, 2014

It's possible that this has something or other to do with newer kernel versions (3.14+) (fs drivers) using more concurrency, as I had never seen this before.

@staabm
Copy link
Contributor

staabm commented Jun 26, 2014

@bobrik I think the performance difference gets bigger with more concurrency and when the system is under heavy (IO) load.

@bobrik
Copy link
Contributor

bobrik commented Jun 26, 2014

@staabm I'd love to see evidence of that. Why should linux make it slow when it could be fast? I don't see any reason.

I did some more testing just to convince you.

Write 1000000 messages to file with 1 process with flock:

23741.071701ms
4001000000bytes
1f340a25f1afb8bb538785c088bb485b

Write 100000 messages from each of 10 concurrent processes with flock:

22741.5876389
22776.4406204
23544.3735123
23877.5174618
24117.3708439
24102.0817757
24263.3070946
24295.6175804
24387.7398968
24602.2770405ms - this is the last process
4001000000 bytes
1f340a25f1afb8bb538785c088bb485b

Write 1000000 messages to file with 1 process without locking:

23303.2646179ms
4001000000bytes
1f340a25f1afb8bb538785c088bb485b

Write 100000 messages from each of 10 concurrent processes without locking:

23038.921833ms
23169.2922115ms
23385.9410286ms
24201.3356686ms
24267.1320438ms
24330.3422928ms
24429.9705029ms
24827.4047375ms
24915.2333736ms
25041.1190987ms - this is the last process 
4001000000 bytes
1f340a25f1afb8bb538785c088bb485b

In all 4 cases disk was utilized by 100%. As you can imagine, with 10 concurrent writers fwrite were 10x slower in both cases, but finish time stayed roughly the same in all tests.

@Seldaek
Copy link
Owner

Seldaek commented Jun 28, 2014

@bobrik thanks! Would you or @aktau want to work on a PR for the StreamHandler's write method? The one question left is what to do in case the locking fails. I would tend towards ignoring failures and just calling fwrite anyway. It sounds a bit dirty but considering all platforms and potential use cases since flock() blocks until a lock can be acquired or it failed to do so I think ignoring is the right thing to do to keep this working everywhere.

@aktau
Copy link
Author

aktau commented Jun 28, 2014

Good research guys. I have a little bit too little time to go through the PR waves at the moment. Bit too occupied with work and neovim.

@Seldaek
Copy link
Owner

Seldaek commented Jul 28, 2014

@aktau You can now create a StreamHandler with the last constructor arg (useLocking) set to true to make it lock before every write.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants