`fs/cache`: fix parent not getting pinned when remote is a file & `hasher`: look for cached hash if passed hash unexpectedly blank #7668

nielash · 2024-03-08T21:47:10Z

What is the purpose of this change?

Three related bug fixes identified on https://forum.rclone.org/t/hasher-with-gdrive-backend-does-not-return-sha1-sha256-for-old-files/44680/9?u=nielash

fs/cache: fix parent not getting pinned when remote is a file
hasher: fix error from trying to stop an already-stopped db
hasher: look for cached hash if passed hash unexpectedly blank

2 and 3 are specific to hasher, but 1 might possibly explain some unexplained behavior in other backends.

Before this change, when cache.GetFn was called on a file rather than a directory, two cache entries would be added (the file + its parent) but only one of them would get pinned if the caller then called Pin(f). This left the other one exposed to expiration if the ci.FsCacheExpireDuration was reached. This was problematic because both entries point to the same Fs, and if one entry expires while the other is pinned, the Shutdown method gets erroneously called on an Fs that is still in use.

An example of the problem showed up in the Hasher backend, which uses the Shutdown method to stop the bolt db used to store hashes. If a command was run on a Hasher file (ex. rclone md5sum --download hasher:somelargefile.zip) and hashing the file took longer than the --fs-cache-expire-duration (5m by default), the bolt db was stopped before the hashing operation completed, resulting in an error.

This change fixes the issue by ensuring that the Pin/Unpin functions also Pin/Unpin any aliases in lockstep, so that pinning a file also pins the cache entry for its parent.

Was the change discussed in an issue or in the forum before?

https://forum.rclone.org/t/hasher-with-gdrive-backend-does-not-return-sha1-sha256-for-old-files/44680/9?u=nielash

Checklist

I have read the contribution guidelines.
I have added tests for all changes in this PR if appropriate.
I have added documentation for the changes if appropriate.
All commit messages are in house style.
I'm done, this Pull Request is ready for review :-)

Before this change, when cache.GetFn was called on a file rather than a directory, two cache entries would be added (the file + its parent) but only one of them would get pinned if the caller then called Pin(f). This left the other one exposed to expiration if the ci.FsCacheExpireDuration was reached. This was problematic because both entries point to the same Fs, and if one entry expires while the other is pinned, the Shutdown method gets erroneously called on an Fs that is still in use. An example of the problem showed up in the Hasher backend, which uses the Shutdown method to stop the bolt db used to store hashes. If a command was run on a Hasher file (ex. `rclone md5sum --download hasher:somelargefile.zip`) and hashing the file took longer than the --fs-cache-expire-duration (5m by default), the bolt db was stopped before the hashing operation completed, resulting in an error. This change fixes the issue by ensuring that the Pin/Unpin functions also Pin/Unpin any aliases in lockstep, so that pinning a file also pins the cache entry for its parent.

Before this change, Hasher would sometimes try to stop a bolt db that was already stopped, resulting in an error. This change fixes the issue by checking first whether the db is already stopped. https://forum.rclone.org/t/hasher-with-gdrive-backend-does-not-return-sha1-sha256-for-old-files/44680/11?u=nielash

Before this change, Hasher did not check whether a "passed hash" (hashtype natively supported by the wrapped backend) returned from a backend was blank, and would sometimes return a blank hash to the caller even when a non-blank hash was already stored in the db. This caused issues with, for example, Google Drive, which has SHA1 / SHA256 hashes for some files but not others (https://rclone.org/drive/#sha1-or-sha256-hashes-may-be-missing) and sometimes also does not have hashes for very recently modified files. After this change, Hasher will check if the received "passed hash" is unexpectedly blank, and if so, it will continue to try other enabled methods, such as retrieving a value from the database, or possibly regenerating it. https://forum.rclone.org/t/hasher-with-gdrive-backend-does-not-return-sha1-sha256-for-old-files/44680/9?u=nielash

ncw

Before this change, when cache.GetFn was called on a file rather than a
directory, two cache entries would be added (the file + its parent)

I wonder if this is the actual bug - adding two entries.

The thinking behind caching two entries was that the backend instance returned when you call NewFs("remote:dir") and NewFs("remote:dir/file.txt") is the same.

However this is not true - the result of the Root() call is different for the two backends, so they should be treated as two backends not one.

So I think a better fix for this might be to remove the caching of the parent in the pointing to a file case.

What do you think?

PS I've cherry-picked the two hasher fixes for the release. This fix can go in the release if done in time or in a point release.

fs/cache/cache.go

nielash · 2024-03-09T16:52:21Z

So I think a better fix for this might be to remove the caching of the parent in the pointing to a file case.\n\nWhat do you think?

I agree! Adding two entries never totally made sense to me.

ncw · 2024-03-09T17:33:17Z

I agree! Adding two entries never totally made sense to me.

Do you want to have a go at that change? Then you can see if it fixes your hasher oddities.

nielash · 2024-03-09T19:32:20Z

Do you want to have a go at that change? Then you can see if it fixes your hasher oddities.

Sure! Looking at it now.

nielash · 2024-03-09T20:58:28Z

I am realizing a slight problem here, which is that Pin(f) relies on fs.ConfigString(f) to get the name, and fs.ConfigString(f) will return the same string for both remote:path and remote:path/file.txt. So if the only cache entry is remote:path/file.txt, Pin(f) will never find it. What do you think is the proper solution to that?

nielash · 2024-03-09T22:21:29Z

the result of the Root() call is different for the two backends

Also, it looks to me like this used to be true, but was changed very recently in c69eb84

As a result, if all you have is f it seems like there's no way of determining which file it should point to...

This reverts commit c4ac032.

Before this change, when cache.GetFn was called on a file rather than a directory, two cache entries would be added (the file + its parent) but only one of them would get pinned if the caller then called Pin(f). This left the other one exposed to expiration if the ci.FsCacheExpireDuration was reached. This was problematic because both entries point to the same Fs, and if one entry expires while the other is pinned, the Shutdown method gets erroneously called on an Fs that is still in use. An example of the problem showed up in the Hasher backend, which uses the Shutdown method to stop the bolt db used to store hashes. If a command was run on a Hasher file (ex. `rclone md5sum --download hasher:somelargefile.zip`) and hashing the file took longer than the --fs-cache-expire-duration (5m by default), the bolt db was stopped before the hashing operation completed, resulting in an error. This change fixes the issue by ensuring that: 1. only one entry is added to the cache (the file's parent, not the file). 2. future lookups correctly find the entry regardless of whether they are called with the parent name or one of its children. 3. fs.ErrorIsFile is returned when (and only when) fsString points to a file (preserving the fix from rclone@8d5bc7f). Note that f.Root() should always point to the parent dir as of rclone@c69eb84

nielash · 2024-03-10T05:49:33Z

Here's an alternate fix -- see what you think. 6c808e6

It eliminates the duplicate entry, but the one it keeps is the parent rather than the file. This is because of what I mentioned above -- that after c69eb84 the f.Root() for remote:path and remote:path/file.txt are indistinguishable, so you can deduce a parent from a child but not the reverse.

This version should ensure that:

only one entry is added to the cache (the file's parent, not the file).
future lookups correctly find the entry regardless of whether they are called with the parent name or one of its children.
fs.ErrorIsFile is returned when (and only when) fsString points to a file (preserving the fix from 8d5bc7f).

If we wanted to do the reverse and cache the file but not its parent, I think what we'd probably have to do is either revert c69eb84 and have f.Root() return a different path, or add a new required method to fs.Fs which returns the file remote path or "" if not a file. Additionally we'd have to double-check that fs.ConfigString(f) returns something different for files vs. dirs. Could potentially be useful for other reasons, but I'm not sure it's necessary here, as there's a less invasive workable solution.

ncw · 2024-03-13T18:05:12Z

Here's an alternate fix -- see what you think. 6c808e6

Wow, that is quite complicated! Probably too complicated for a point release - I'm really allergic to breaking stuff in a point release!

It eliminates the duplicate entry, but the one it keeps is the parent rather than the file. This is because of what I mentioned above -- that after c69eb84 the f.Root() for remote:path and remote:path/file.txt are indistinguishable, so you can deduce a parent from a child but not the reverse.

Note that commit is just one of many fixing the same problem in lots of backends!

This version should ensure that:

only one entry is added to the cache (the file's parent, not the file).

future lookups correctly find the entry regardless of whether they are called with the parent name or one of its children.

fs.ErrorIsFile is returned when (and only when) fsString points to a file (preserving the fix from 8d5bc7f).

I think those seem correct.

Can we do it more simply?

One thing I was thinking is that if we changed the contents of the cache from fs.Fs to something like

struct {
    f fs.Fs
    files []string
}

Then when inserting a ErrorIsFile we'd just insert the parent but add an entry to files, and when looking up we'd look for the fsString as-is then look it up without the last path entry. If a member of files matches then we could return the f with ErrorIsFile.

If we wanted to do the reverse and cache the file but not its parent, I think what we'd probably have to do is either revert c69eb84 and have f.Root() return a different path, or add a new required method to fs.Fs which returns the file remote path or "" if not a file. Additionally we'd have to double-check that fs.ConfigString(f) returns something different for files vs. dirs. Could potentially be useful for other reasons, but I'm not sure it's necessary here, as there's a less invasive workable solution.

I don't want to undo c69eb84 as it fixed a small but important data loss bug.

The whole returning fs.ErrorIsFile is not very pleasant, but I haven't thought of anything better! One idea I had was that you always open a backend at the root with NewFs then you Chroot into the directory that you want, but that isn't much better.

nielash · 2024-03-13T19:21:41Z

I'm really allergic to breaking stuff in a point release!

I would argue that it is already broken and this patch un-breaks it 😄 But I hear you!

It seems plausible that the same issue could be breaking copyto and moveto operations that take longer than 5 minutes, for backends with a Shutdown method. (I haven't tested this)

Can we do it more simply?

My first version (c4ac032) is still an option. It's much simpler. (Leave the duplicate but Pin/Unpin them in lockstep)

One thing I was thinking is that if we changed the contents of the cache from fs.Fs to something like
struct {
    f fs.Fs
    files []string
}
Then when inserting a ErrorIsFile we'd just insert the parent but add an entry to files, and when looking up we'd look for the fsString as-is then look it up without the last path entry. If a member of files matches then we could return the f with ErrorIsFile.

That's very similar to what my newest version is doing. I think it would work -- but it might mean having to call NewFs more often, because we'd lose all the files []string knowledge every time the cache expires.

I don't want to undo c69eb84 as it fixed a small but important data loss bug.

I don't think we need to.

The whole returning fs.ErrorIsFile is not very pleasant, but I haven't thought of anything better! One idea I had was that you always open a backend at the root with NewFs then you Chroot into the directory that you want, but that isn't much better.

Yeah I don't have any perfect solutions either. One idea might be to have NewFs and cache.Get return an optional remote string param like cmd.NewFsFile, which can be "" if not pointing to a file. (Maybe also an fs.Directory to help with #7652)

My overall observation as someone still learning all of this is that paths are much easier to reason about on the CLI than in the code -- f.Root() does not always mean what I'd expect!

nielash added bug hashing Remote: Hasher labels Mar 8, 2024

nielash added 2 commits March 8, 2024 17:11

nielash force-pushed the fix-hasher-cache branch from 98068b5 to 5466cb2 Compare March 8, 2024 22:11

nielash marked this pull request as ready for review March 8, 2024 22:34

ncw reviewed Mar 9, 2024

View reviewed changes

fs/cache/cache.go Outdated Show resolved Hide resolved

nielash added 2 commits March 9, 2024 20:51

Revert "fs/cache: fix parent not getting pinned when remote is a file"

792dc17

This reverts commit c4ac032.

nielash force-pushed the fix-hasher-cache branch from e59f5e6 to 6c808e6 Compare March 10, 2024 05:16

nielash added the Point Release Candidate label Mar 11, 2024

nielash mentioned this pull request Mar 31, 2024

[FTPS] transfer hangs #7022

Open

nielash mentioned this pull request Apr 22, 2024

bisync: more fixes for integration tests #7743

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`fs/cache`: fix parent not getting pinned when remote is a file & `hasher`: look for cached hash if passed hash unexpectedly blank #7668

`fs/cache`: fix parent not getting pinned when remote is a file & `hasher`: look for cached hash if passed hash unexpectedly blank #7668

nielash commented Mar 8, 2024

ncw left a comment

nielash commented Mar 9, 2024

ncw commented Mar 9, 2024

nielash commented Mar 9, 2024

nielash commented Mar 9, 2024

nielash commented Mar 9, 2024

nielash commented Mar 10, 2024

ncw commented Mar 13, 2024

nielash commented Mar 13, 2024

fs/cache: fix parent not getting pinned when remote is a file & hasher: look for cached hash if passed hash unexpectedly blank #7668

Are you sure you want to change the base?

fs/cache: fix parent not getting pinned when remote is a file & hasher: look for cached hash if passed hash unexpectedly blank #7668

Conversation

nielash commented Mar 8, 2024

What is the purpose of this change?

Was the change discussed in an issue or in the forum before?

Checklist

ncw left a comment

Choose a reason for hiding this comment

nielash commented Mar 9, 2024

ncw commented Mar 9, 2024

nielash commented Mar 9, 2024

nielash commented Mar 9, 2024

nielash commented Mar 9, 2024

nielash commented Mar 10, 2024

ncw commented Mar 13, 2024

nielash commented Mar 13, 2024

`fs/cache`: fix parent not getting pinned when remote is a file & `hasher`: look for cached hash if passed hash unexpectedly blank #7668

`fs/cache`: fix parent not getting pinned when remote is a file & `hasher`: look for cached hash if passed hash unexpectedly blank #7668