nix copy uses too much memory #1681
Comments
Intuitively it feels that it should be possible for it to run in constant memory. What am I missing? |
I'm encountering this issue with a single path — |
Continuation of 97002b6. This makes the daemon use constant memory. For example, it reduces the daemon's maximum RSS on $ nix copy --from ~/my-nix --to daemon /nix/store/1n7x0yv8vq6zi90hfmian84vdhd04bgp-blender-2.79a from 264 MiB to 7 MiB. We now use a TunnelSource to prevent the connection from ending up in an undefined state if an exception is thrown while the NAR is being sent. Issue NixOS#1681.
This reduces memory consumption of nix copy --from file://... --to ~/my-nix /nix/store/95cwv4q54dc6giaqv6q6p4r02ia2km35-blender-2.79 from 514 MiB to 18 MiB for an uncompressed binary cache, and from 192 MiB to 53 MiB for a bzipped binary cache. It may also be faster because fetching can happen concurrently with decompression/writing. Continuation of 48662d1. Issue NixOS#1681.
This reduces memory consumption of nix copy --from https://cache.nixos.org --to ~/my-nix /nix/store/95cwv4q54dc6giaqv6q6p4r02ia2km35-blender-2.79 from 176 MiB to 82 MiB. (The remaining memory is probably due to xz decompression overhead.) Issue NixOS#1681. Issue NixOS#1969.
I see commits purporting to address this for a number of different cases, but none concerning uploading to a S3 bucket. Trying to copy a 2.8GB store path to a S3 bucket took nearly 4GB of memory and more than twenty minutes of 100% CPU. Has that been fixed? |
This reduces memory consumption of nix copy --from file://... --to ~/my-nix /nix/store/95cwv4q54dc6giaqv6q6p4r02ia2km35-blender-2.79 from 514 MiB to 18 MiB for an uncompressed binary cache, and from 192 MiB to 53 MiB for a bzipped binary cache. It may also be faster because fetching can happen concurrently with decompression/writing. Continuation of 48662d1. Issue NixOS#1681.
This reduces memory consumption of nix copy --from https://cache.nixos.org --to ~/my-nix /nix/store/95cwv4q54dc6giaqv6q6p4r02ia2km35-blender-2.79 from 176 MiB to 82 MiB. (The remaining memory is probably due to xz decompression overhead.) Issue NixOS#1681. Issue NixOS#1969.
Hitting this issue trying to do something like - nixos-rebuild build ; nix copy ./result --to ssh://low_ram_machine @dtzWill will those experimental changes help with ssh copy? |
@Ralith I'm probably not going to make S3BinaryCacheStore do uploads in constant space. It might not even be supported by aws-sdk-cpp. I assume the 100% CPU is caused by compression, which you can disable. |
FWIW I too am another big-upload-to-S3 guy using It would surprise me if aws-sdk-cpp didn't support it, given that S3 supports almost arbitrarily large objects and multi-part uploads. If someone figured out how to implement it, would you accept the PR? |
It seems very strange that it would take twenty minutes on my i7-4980HQ, even so. 2.8GB is big but it's not that big. |
IIRC xz compression can easily take that long. |
This is what I am seeing too:
The main problem I see is that it merely says "out of memory", instead of saying how much it tried to allocate, and how much was available before the allocation in the error message. Copying data should run in constant space as others have already mentioned. If the compression is causing higher memory requirements than needed, this is a problem too, because it raises the hosting costs for no reason other than the initial deployment. Before the deployment at least 300MB was available on host |
FWIW it looks like they do support streaming at least for fetches: https://sdk.amazonaws.com/cpp/api/LATEST/index.html (Near end, look for IOStreams). Hopefully upload has similar. Seconded re:xz compression taking that long. There's an option somewhere to enable parallel xz compression is you have idle cores. IIRC the result will be slightly bigger for the same compression level. Anyway, if someone tackled the API spelunking would it be welcome? Or is there a reason that will have problems or is a bad idea? EDIT: oops I think we already use the stream thing, although at a glance it looks like we pull it all into a string but that seems resolvable. Anyway fetch from s3 is probably not as important. |
As far as I can tell, the fixes in 2.0.1 still don't really fix the issue. |
@lheckemann IIRC we didn't cherry-pick any memory improvements in 2.0.1. You need master for some of the fixes or my experimental branch for the rest. |
Oh, that would explain it! Any chance they could be included in a 2.0.2 release? There have been so many complaints about this issue on IRC and I've run into it myself more times than I would like as well. |
Does "nixops deploy" use this? I get out of memory during deploy, even though I have several gigabytes free (both on disk and working memory) which is odd. Just wondering if this is addressed here or should be investigated further. |
@SebastianCallh you are not specifying which machine goes out of memory, so I assume you don't know it's talking about the machine you are deploying to. The solution to this is to use 512MB of swap. Perhaps I might commit some of my changes to fix this in an AWS environment when t2.nanos are being used, but only if there is interest in them from people with commit access. |
@coretemp That was the machine I was referring to. The machine being deployed too has plenty of both disk and working memory to spare when the error occurs. |
Continuation of 97002b6. This makes the daemon use constant memory. For example, it reduces the daemon's maximum RSS on $ nix copy --from ~/my-nix --to daemon /nix/store/1n7x0yv8vq6zi90hfmian84vdhd04bgp-blender-2.79a from 264 MiB to 7 MiB. We now use a TunnelSource to prevent the connection from ending up in an undefined state if an exception is thrown while the NAR is being sent. Issue #1681.
Please don't release too many of the recent memory fixes until we've fixed #2203--apologies if the proposed changes don't depend on the bits that broke |
edolstra/nix@c94b4fc is only controversial, because it raises the cost of cloud resources without a good reason in many cases. If it would simply inspect the machine to check how much storage is available and/or memory is available, it could use that as a default solution. Another solution is that it would just run until it would go out of memory and then try again with the optimization applied automatically. This way it will always work. For the people that want to optimize the last bits of performance, you could add variables (like already exist, but probably need better names) to control this behavior. By using this design, everyone would be happy. Similarly, you could have flags that optimize for deployment time (e.g. waste more cloud resources to optimize for developer time). As a guiding principle, I would like to see acknowledged that increasing cloud resource cost is weighed heavily in implementation decisions. In general, even if you don't implement exactly one of the suggestions above, it is likely possible to create something non controversial. The problem with the existing patch is that the variable one can control is an implementation detail, not a high level policy. |
Continuation of 97002b6. This makes the daemon use constant memory. For example, it reduces the daemon's maximum RSS on $ nix copy --from ~/my-nix --to daemon /nix/store/1n7x0yv8vq6zi90hfmian84vdhd04bgp-blender-2.79a from 264 MiB to 7 MiB. We now use a TunnelSource to prevent the connection from ending up in an undefined state if an exception is thrown while the NAR is being sent. Issue NixOS#1681.
This reduces memory consumption of nix copy --from file://... --to ~/my-nix /nix/store/95cwv4q54dc6giaqv6q6p4r02ia2km35-blender-2.79 from 514 MiB to 18 MiB for an uncompressed binary cache, and from 192 MiB to 53 MiB for a bzipped binary cache. It may also be faster because fetching can happen concurrently with decompression/writing. Continuation of 48662d1. Issue NixOS#1681.
This reduces memory consumption of nix copy --from https://cache.nixos.org --to ~/my-nix /nix/store/95cwv4q54dc6giaqv6q6p4r02ia2km35-blender-2.79 from 176 MiB to 82 MiB. (The remaining memory is probably due to xz decompression overhead.) Issue NixOS#1681. Issue NixOS#1969.
The OOM condition is rather hard to handle, as it depends on the host OS. Typically it will let you allocate too much and then invoke an OOM killer later, so you don't have the option to react to the condition nicely. |
Has this been fixed in Nix 2.1? |
Why is this critical issue not being addressed? |
@vaibhavsagar I think so.
|
This comment was marked as disruptive content.
This comment was marked as disruptive content.
@coretemp please behave with respect and avoid Ad Hominem, as it does no good to anyone. Nix is provided for free and comes with zero obligations from developers. If you'd like professional support, I'd recommend contacting some of the consulting companies: https://nixos.org/nixos/support.html I'm locking this issue as nothing good can come out of this, if there's an issue with the recent fix, please open another issue describing the problem. |
coretemp was banned since, so we can unlock. |
I have backported @edolstra's memory fixes to Nix 2.0.4...nh2:nh2-2.0.4-issue-1681-cherry-pick Note this fixes the case where the machine that's running |
I think this issue is solved in Nix 2.2 at least for my use cases (given that my ram problems in nixops disappear in my backport, including #38808). But it would make sense to ask around among the subscribers to this issues if you have observed any further If not, we can probably close this. (There is still #2774 which says that 2.2 is used and which is relatively recent.) So, does anybody here still have memory problems with current nix? |
I have. |
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment has been hidden.
This comment has been hidden.
Yes, on Nix 2.2.2 I'm still seeing several GB of memory usage when substituting large paths (e.g. GHC) from a cache (as part of a larger build). This is problematic for running Nixery on something like Cloud Run where memory is hard-capped at 2GB. I haven't yet tried this with 2.3 to see if it makes a difference, but it's on the todo-list. Edit: I won't be able to test this with 2.3 easily, as it no longer works in gVisor even with my SQLite connection patch. Might get around to more advanced debugging during the weekend ... |
I have observed this when copying a locally built output to a http cache:
and have observed The memory usage slowly but surely rises towards that number (and never goes down) as |
I think what happens here is that EDIT: nix version 2.3.1 |
I'm running nix copy in runInLinuxVM, and notice that for any nontrivial closures, the VM will run out of memory during the copying process. I left it set at the default 512 megabytes. I could obviously increase the amount of memory the VM is given, but that doesn't scale for copying complex derivations with many dependencies.
I suggest adding an option to only load and copy the contents of the paths one at a time, or even better, a way to specify an upper bound on the memory to be used while copying.
The text was updated successfully, but these errors were encountered: