eval cache: fix and make re-evaluation of cached errors configurable #10368

Ma27 · 2024-03-30T12:19:19Z

Motivation

I'm on Nix 2.19 and I wanted to fix the dreaded cached failure of attribute thing (again).
I realized that it's gone on 2.20+, but it turns out the eval cache just regressed and doesn't seem to cache anything. Fix for that is in the first commit.
The second commit adds an eval setting to configure whether or not to force re-evaluation of cached failures.

Future work:

set this to false by default for a few commands such as nix search.
cache error messages (and traces?) as well.

Context

I'd recommend to review commit-wise.

Priorities and Process

Add 👍 to pull requests you find important.

The Nix maintainer team uses a GitHub project board to schedule and track reviews.

…ided In 6558da4 the way the cache key for the eval cache was computed was deferred to libfetchers. However, the git fetcher only provides a fingerprint if a rev is given which seems reasonable. That breaks the eval cache however because calls like `nix build .#` are uncached now because the input doesn't have a revision. Adding the old behavior back only if libfetchers doesn't return a fingerprint and add a regression test. Not sure btw why I'd need to run `nix build` three times when having `packages.$system.foo` fail on evaluation until it's cached, added a FIXME for now since that's something we'll probably want to solve in the future. Also interesting that if the attribute `foo` fails, it never gets cached currently 🤔

When I run `nix build` on a broken attribute in a flake twice, I get the dreaded error: cached failure of attribute 'snens' This also happens when you re-run the previous command with `--show-trace`. While trying to fix that, Linus and I discovered that the eval cache was broken for use on local flakes (i.e. `nix build .#`) was broken which got fixed in the commit before. We decided to use an eval setting for that because `forceError` seems to have been added somewhat arbitrarily as argument to the functions of AttrCursor where this was needed at the time it was developed. The effect was that it was hard for us to understand when an invocation can force errors and when it doesn't. For instance, `AttrCursor::isDerivation()` never forces re-evaluation of errors, so if `maybeGetAttr("type")` is the first thing that breaks eval, you'd always run into a cached failure. For now, this is `true` by default because I think the current behavior with a properly working eval cache is pretty painful while hacking on (or with) a flake. Potential future improvements are * Setting `evalSettings.forceErrorsInEvalCache` to `false` for commands like `nix search`. * Allowing the eval cache to cache the full failure (right now, `failed_t` is empty). That way, you could get errors when re-running `nix build` without having to re-evaluate everything, perhaps even with traced. That's the solution Linus and I would prefer. Co-authored-by: Linus Heckemann <git@sphalerite.org>

edolstra · 2024-04-02T09:43:31Z

Disabling caching for dirty trees was intentional (and part of the incremental lazy-trees backport). Computing a cache key for a dirty Git tree currently requires reading the entire workdir, so that's very not-lazy. (We could compute a cache key more cheaply from the Git revision and the contents of the dirty files, but that requires more work.)

Note that flake.path.to_string() is not a valid cache key, since for a dirty Git tree it will return something like /path/to/tree, which doesn't change if the contents of the tree change.

Regarding "cached failure of attribute": can you create an issue for that? Seeing that message should be considered a bug, since in case of a cached failure, we should always reevaluate to get the actual error.

roberth · 2024-04-02T11:59:54Z

Computing a cache key for a dirty Git tree currently requires reading the entire workdir,

Determining whether the tree is dirty already requires reading the entire workdir.
I don't know off the top of my head the details, but these are good starting points:

romkatv's work on optimizing and/or replacing "git status"
if something uses mtime, we could use the same info to produce "less perfect" but similarly valid cache keys

Furthermore, concurrent mutation could lead to incorrect cache entries if implemented naively:

It seems feasible to have cache keys that only involve the files that are actually read, although remembering which files may affect any given attribute is of course more complicated than a single key. (a lot more complicated, but also a lot more effective caching)
A simpler solution is to always "stage" the workdir, except to our own index instead of the normal git index. This gets rid of the fragile custom code paths that access the workdir directly.

We should probably do the latter anyway, so that git clean filters are applied, just as they would be by git when the files are committed.

Ericson2314 · 2024-04-02T15:41:01Z

We ought to also be able to cheaply get a git tree for staged data

edolstra · 2024-04-03T09:47:59Z

Determining whether the tree is dirty already requires reading the entire workdir.

It doesn't, git status is very fast because it relies on inodes' mtimes (or ctimes?) instead of reading file contents.

Ma27 · 2024-04-04T22:56:35Z

About the issue itself

can you create an issue for that? Seeing that message should be considered a bug, since in case of a cached failure, we should always reevaluate to get the actual error.

There's already #3872 and in fact I fixed the problem once already: #4688

The patch applies (short of the regression-test) cleanly on 2.19 and is the main motivation why I filed this PR, I wanted to get rid of this behavior. So we could actually backport to 2.19 & 2.18 (which is the stable minor in nixpkgs IIRC).

It probably exists on 2.20+ as well, however I didn't bother to check because it's rather hard to trigger the eval cache reliably / easy to hit issues (or I just have bad luck) and it was very hard for me to trigger the issue reliably. Upon investigating I found out that dirty trees are simply not cached anymore. But anyways, we'll probably need the second commit on 2.20+ as well.
And I should file an issue for the weird caching behavior with nested attributes.

I'd like to go into a direction where we cache errors again, but we also cache the error messages.

Whether or not to cache dirty trees

Note that flake.path.to_string() is not a valid cache key, since for a dirty Git tree it will return something like /path/to/tree, which doesn't change if the contents of the tree change.

Sure, it's part of a cache key with a few other things.

Computing a cache key for a dirty Git tree currently requires reading the entire workdir, so that's very not-lazy

Does it though? Which part of the old cache-key requires reading the entire tree?

It seems feasible to have cache keys that only involve the files that are actually read, although remembering which files may affect any given attribute is of course more complicated than a single key

That actually seems like an interesting follow-up task although it requires a major rework of the evaluation cache. Is this a change the Nix team would be interested in?
I'd strictly focus on solving the current issue first, but it may be a useful future improvement.

A simpler solution is to always "stage" the workdir, except to our own index instead of the normal git index

So my knowledge about the Nix codebase might've gotten a little dusty in the past year, but what does "our" own index" mean in this context?

roberth · 2024-04-05T11:17:07Z

It doesn't, git status is very fast because it relies on inodes' mtimes (or ctimes?) instead of reading file contents.

As I've said,

if something uses mtime, we could use the same info to produce "less perfect" but similarly valid cache keys

Ma27 and others added 2 commits March 30, 2024 13:13

Ma27 requested a review from edolstra as a code owner March 30, 2024 12:19

github-actions bot added the with-tests Issues related to testing. PRs with tests have some priority label Mar 30, 2024

Ma27 mentioned this pull request Apr 5, 2024

Nix flakes should not cache evaluation errors #3872

Closed

edolstra mentioned this pull request Apr 15, 2024

InstallableFlake::getCursors(): Force re-evaluation in case of cached failure #10513

Closed

Ma27 closed this Apr 26, 2024

Ma27 deleted the eval-cache-fixes branch April 26, 2024 21:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

eval cache: fix and make re-evaluation of cached errors configurable #10368

eval cache: fix and make re-evaluation of cached errors configurable #10368

Ma27 commented Mar 30, 2024

edolstra commented Apr 2, 2024

roberth commented Apr 2, 2024

Ericson2314 commented Apr 2, 2024

edolstra commented Apr 3, 2024

Ma27 commented Apr 4, 2024

roberth commented Apr 5, 2024

eval cache: fix and make re-evaluation of cached errors configurable #10368

eval cache: fix and make re-evaluation of cached errors configurable #10368

Conversation

Ma27 commented Mar 30, 2024

Motivation

Context

Priorities and Process

edolstra commented Apr 2, 2024

roberth commented Apr 2, 2024

Ericson2314 commented Apr 2, 2024

edolstra commented Apr 3, 2024

Ma27 commented Apr 4, 2024

About the issue itself

Whether or not to cache dirty trees

roberth commented Apr 5, 2024