-
-
Notifications
You must be signed in to change notification settings - Fork 303
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unrecoverable memory corruption if db concurrency > 6 #542
Comments
Help would be appreciated because this issue is murdering us right now. |
CC: @domenkozar |
I haven't seen this one before but I'd recommend doing a coredump that can be inspected. |
This should be fixed by NixOS/nix@24b7398. |
@edolstra thank you for the quick response! I will apply that patch to our configuration and test it today. I'm providing more context about our configuration for posterity. We are pinned to the following version of {
"url": "https://github.com/NixOS/nixpkgs.git",
"rev": "74286ec9e76be7cd00c4247b9acb430c4bd9f1ce",
"date": "2018-01-15T12:35:29-05:00",
"sha256": "13ydgpzl5nix4gc358iy9zjd5nrrpbpwpxmfhis4aai2zmkja3ak",
"fetchSubmodules": true
} This, is the version of Hydra we are running: We have a few patches applied to the Hydra derivation, one for building pull requests off of our Enterprise Github instance, one fixing a Our Hydra configuration is:
Our Postgresql configuration (some of which is cribbed from https://github.com/NixOS/nixos-org-configurations/blob/master/delft/chef.nix) is:
The Nix configuration has The host is a bare metal machine and it has 251GB of high-speed RAM, two 500GB SSDs, and one 500GB SSD on NVMe. The hosts CPU hardware is:
We have four build slaves (one of which is the Hydra host itself) that Hydra is configured to use as well, here is the machines configuration:
|
I tried to debug the I will test this patch by not running I will report back to this issue ticket if we see |
I can confirm that Additionally, another issue we were observing but didn't think was related and have not seen at all today, was More time will tell but I think @edolstra's fix was the ticket. |
It was holding on to a Value* (i.e. a std::shared_ptr<ValidPathInfo>*) outside of the pathInfoCache lock, so the std::shared_ptr could be destroyed between the release of the lock and the decrement of the std::shared_ptr refcount. This can happen if more than 'path-info-cache-size' paths are added in the meantime, *or* if clearPathInfoCache() is called. The hydra-queue-runner queue monitor thread periodically calls the later, so is likely to trigger a crash. Fixes NixOS/hydra#542. (cherry picked from commit 24b7398)
We're having a fairly serious issue with our Hydra deployment in that, occasionally (we haven't tracked down what tends to cause this), a double free or other memory-related crash will happen that subsequent restarts of the
hydra-queue-runner
can't recover from (i.e. they crash immediately upon start):The only way we've been able to recover from this error is by reducing max_db_concurrency to 6, letting Hydra chew through builds for a while, then increasing the max_db_concurrency.
We can't leave the max_db_concurrency at 6 because Hydra can't otherwise keep up with build volume at all and ends up taking many hours to schedule new builds being added by the evaluator.
Has anyone else encountered this before? Is it a postgresql configuration issue?
@edolstra could you weigh in? I've asked in IRC and no one else has any insight into these issues.
The text was updated successfully, but these errors were encountered: