Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid opening too many jobs in the collector. Only parse so many at o… #203

Closed
wants to merge 3 commits into from

Conversation

toddr
Copy link
Collaborator

@toddr toddr commented Nov 9, 2020

…ne time.

Most unix systems limit the number of file handles any one process can
have open at a time to 1024. When the collector gets behind, it may try
to open too many files for completed jobs at once. Limit this with a
constant. They'll be picked up once the ones polled are processed.

@toddr
Copy link
Collaborator Author

toddr commented Nov 9, 2020

This corrects the problems identified on #201 with too many file handles being open at once by the collector.

Copy link
Collaborator

@atoomic atoomic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tested and approved :-)

Copy link
Member

@exodist exodist left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want this configurable (you can add to your .yath.rc) with no behavior change by default.

This would also require documentation on when/why someone would use the option (your situation)

lib/Test2/Harness/Collector.pm Outdated Show resolved Hide resolved
@exodist
Copy link
Member

exodist commented Nov 9, 2020

I know ZipRecruiter used a -j56 when I was there on a super beefy machine. I do think we ran into this limit and simply bumped the max open filehandles limit.

ok, I am fine with a default, since it is a number of jobs we can stick with 200. But I want it configurable.

Also, I wonder if we can catch this exception and have it handle it gracefully... or if nothing else catch it, report that the user should probably lower this config options value, then die. That does not need to be part of this PR though if it is a lot of work.

@toddr
Copy link
Collaborator Author

toddr commented Nov 9, 2020

Concerning the -j56, I think some recent jitter in releases has somehow slowed down the collector. I was even wondering if we were accidentally not using Cpanel::JSON::XS. What we do know is that updating to the latest T2 has caused this to happen where it was not previously.

@toddr
Copy link
Collaborator Author

toddr commented Nov 9, 2020

We're totally ok with an option and documenting it. Updating the PR now.

@exodist
Copy link
Member

exodist commented Nov 9, 2020

ok, yeah, add the option and docs, I will merge this, and then without a rush we can look into speeding up the collector (or fixing whatever slowed it down)

@toddr
Copy link
Collaborator Author

toddr commented Nov 9, 2020

@exodist: To add an option, am I going to have to create lib/App/Yath/Options/Collector.pm ?

@exodist
Copy link
Member

exodist commented Nov 9, 2020

Hmm, yes, that may be necessary, any command that starts a collector (test and run currently) would need to include those options, so yeah they should be their own category/module. But that's the kind of flexibility this system was made to allow, so should not be hard to do. If you get stuck let me know, but I think everything is likely to get passed to all the right places already, so should not be a difficult change. Let me know if you get stuck.

lib/App/Yath/Command/collector.pm Outdated Show resolved Hide resolved
lib/App/Yath/Command/collector.pm Outdated Show resolved Hide resolved
lib/Test2/Harness/Collector.pm Outdated Show resolved Hide resolved
lib/Test2/Harness/Util/Queue.pm Show resolved Hide resolved
@toddr toddr force-pushed the collector_fh branch 2 times, most recently from f0e7316 to 83c92da Compare November 10, 2020 00:50
@exodist
Copy link
Member

exodist commented Nov 10, 2020

@toddr @atoomic I was going to explain how I needed things to change, but I had the time so I just added a commit instead. Now I need you guys to verify my changes still fix your problem. The changes you made would have fixed your problem, but the option was not actually usable and the collector would have been broken in subtle ways. If any of my changes need explanations please let me know.

If this does fix your issues I can merge and release.

@exodist exodist self-assigned this Nov 10, 2020
@exodist exodist requested a review from atoomic November 10, 2020 04:06
@toddr
Copy link
Collaborator Author

toddr commented Nov 10, 2020

I can't look till the morning but I am almost certain that 1000 will break it. processing each job requires multiple handles so 1000x2 is > 1024. It's really not important that the loop doesn't suck everything up each time since it'll get them on the next go around.

@exodist
Copy link
Member

exodist commented Nov 10, 2020

Yeah, please do wait until you have time to fully read it, the 1000 you commented on is a different setting that has no effect on number of open handles, it is just a second option I chose to expose and the value is unchanged. I kept the 300 default that you had in your PR commit.

@toddr
Copy link
Collaborator Author

toddr commented Nov 10, 2020

Thanks for your help on this

Copy link
Collaborator Author

@toddr toddr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes look sane except for how and when it is messaging when it's backed up. We probably should be recommending they reduce the -j value, since setting it higher than CPU can cause a load which is pretty much the root cause of this problem. This also means that -j can never exceed 300 without significant complications.

my $max_open_jobs = $self->settings->collector->max_open_jobs // 1024;
my $additional_jobs_to_parse = $max_open_jobs - keys %$jobs;
if($additional_jobs_to_parse <= 0) {
$self->send_backed_up;
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should never happen. If it does, it means an entire processing loop happened and no jobs were processed to completion. Given there are 300 jobs, that'd be bad or you passed a REALLY big number to -j.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It happened in my testing when I set -j to 2 and set --max-open-jobs to 1. Not everyone will be using the defaults.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure but I would argue it's a bug to set --max-open-jobs to less than -j and maybe we should warn on that?

lib/Test2/Harness/Collector.pm Outdated Show resolved Hide resolved
lib/Test2/Harness/Collector.pm Outdated Show resolved Hide resolved
@toddr
Copy link
Collaborator Author

toddr commented Nov 10, 2020

I've pushed a commit trying to straighten out the messaging. I'm going to try to test it now.

toddr and others added 2 commits November 10, 2020 09:15
…ne time.

Most unix systems limit the number of file handles any one process can
have open at a time to 1024. When the collector gets behind, it may try
to open too many files for completed jobs at once. Limit this with a
new collector setting which defaults to 300. They'll be picked up once
the ones polled are processed.

This commit adds App::Yath::Options::Collector
 * Added another collector option
 * Renamed collector option
 * Put option in test and start commands
 * Make warning when job limit is hit a proper event
 * Only issue the warning once per test run
 * Use the correct settings file in collector
@toddr
Copy link
Collaborator Author

toddr commented Nov 12, 2020

We're testing this right now on our systems. Surprisingly even with -j$nproc, the collector runs behind pretty much the whole run until it gets to the end where single threaded tests run.

@exodist
Copy link
Member

exodist commented Nov 18, 2021

This has been merged

@exodist exodist closed this Nov 18, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants