Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fs/cache: fix parent not getting pinned when remote is a file & hasher: look for cached hash if passed hash unexpectedly blank #7668

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

nielash
Copy link
Collaborator

@nielash nielash commented Mar 8, 2024

What is the purpose of this change?

Three related bug fixes identified on https://forum.rclone.org/t/hasher-with-gdrive-backend-does-not-return-sha1-sha256-for-old-files/44680/9?u=nielash

  1. fs/cache: fix parent not getting pinned when remote is a file
  2. hasher: fix error from trying to stop an already-stopped db
  3. hasher: look for cached hash if passed hash unexpectedly blank

2 and 3 are specific to hasher, but 1 might possibly explain some unexplained behavior in other backends.

Before this change, when cache.GetFn was called on a file rather than a directory, two cache entries would be added (the file + its parent) but only one of them would get pinned if the caller then called Pin(f). This left the other one exposed to expiration if the ci.FsCacheExpireDuration was reached. This was problematic because both entries point to the same Fs, and if one entry expires while the other is pinned, the Shutdown method gets erroneously called on an Fs that is still in use.

An example of the problem showed up in the Hasher backend, which uses the Shutdown method to stop the bolt db used to store hashes. If a command was run on a Hasher file (ex. rclone md5sum --download hasher:somelargefile.zip) and hashing the file took longer than the --fs-cache-expire-duration (5m by default), the bolt db was stopped before the hashing operation completed, resulting in an error.

This change fixes the issue by ensuring that the Pin/Unpin functions also Pin/Unpin any aliases in lockstep, so that pinning a file also pins the cache entry for its parent.

Was the change discussed in an issue or in the forum before?

Checklist

  • I have read the contribution guidelines.
  • I have added tests for all changes in this PR if appropriate.
  • I have added documentation for the changes if appropriate.
  • All commit messages are in house style.
  • I'm done, this Pull Request is ready for review :-)

Before this change, when cache.GetFn was called on a file rather than a
directory, two cache entries would be added (the file + its parent) but only one
of them would get pinned if the caller then called Pin(f). This left the other
one exposed to expiration if the ci.FsCacheExpireDuration was reached. This was
problematic because both entries point to the same Fs, and if one entry expires
while the other is pinned, the Shutdown method gets erroneously called on an Fs
that is still in use.

An example of the problem showed up in the Hasher backend, which uses the
Shutdown method to stop the bolt db used to store hashes. If a command was run
on a Hasher file (ex. `rclone md5sum --download hasher:somelargefile.zip`) and
hashing the file took longer than the --fs-cache-expire-duration (5m by default), the
bolt db was stopped before the hashing operation completed, resulting in an
error.

This change fixes the issue by ensuring that the Pin/Unpin functions also
Pin/Unpin any aliases in lockstep, so that pinning a file also pins the cache
entry for its parent.
Before this change, Hasher would sometimes try to stop a bolt db that was
already stopped, resulting in an error. This change fixes the issue by checking
first whether the db is already stopped.

https://forum.rclone.org/t/hasher-with-gdrive-backend-does-not-return-sha1-sha256-for-old-files/44680/11?u=nielash
Before this change, Hasher did not check whether a "passed hash" (hashtype
natively supported by the wrapped backend) returned from a backend was blank,
and would sometimes return a blank hash to the caller even when a non-blank hash
was already stored in the db. This caused issues with, for example, Google
Drive, which has SHA1 / SHA256 hashes for some files but not others
(https://rclone.org/drive/#sha1-or-sha256-hashes-may-be-missing) and sometimes also
does not have hashes for very recently modified files.

After this change, Hasher will check if the received "passed hash" is
unexpectedly blank, and if so, it will continue to try other enabled methods,
such as retrieving a value from the database, or possibly regenerating it.

https://forum.rclone.org/t/hasher-with-gdrive-backend-does-not-return-sha1-sha256-for-old-files/44680/9?u=nielash
@nielash nielash marked this pull request as ready for review March 8, 2024 22:34
Copy link
Member

@ncw ncw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before this change, when cache.GetFn was called on a file rather than a
directory, two cache entries would be added (the file + its parent)

I wonder if this is the actual bug - adding two entries.

The thinking behind caching two entries was that the backend instance returned when you call NewFs("remote:dir") and NewFs("remote:dir/file.txt") is the same.

However this is not true - the result of the Root() call is different for the two backends, so they should be treated as two backends not one.

So I think a better fix for this might be to remove the caching of the parent in the pointing to a file case.

What do you think?

PS I've cherry-picked the two hasher fixes for the release. This fix can go in the release if done in time or in a point release.

fs/cache/cache.go Outdated Show resolved Hide resolved
@nielash
Copy link
Collaborator Author

nielash commented Mar 9, 2024

So I think a better fix for this might be to remove the caching of the parent in the pointing to a file case.\n\nWhat do you think?

I agree! Adding two entries never totally made sense to me.

@ncw
Copy link
Member

ncw commented Mar 9, 2024

I agree! Adding two entries never totally made sense to me.

Do you want to have a go at that change? Then you can see if it fixes your hasher oddities.

@nielash
Copy link
Collaborator Author

nielash commented Mar 9, 2024

Do you want to have a go at that change? Then you can see if it fixes your hasher oddities.

Sure! Looking at it now.

@nielash
Copy link
Collaborator Author

nielash commented Mar 9, 2024

I am realizing a slight problem here, which is that Pin(f) relies on fs.ConfigString(f) to get the name, and fs.ConfigString(f) will return the same string for both remote:path and remote:path/file.txt. So if the only cache entry is remote:path/file.txt, Pin(f) will never find it. What do you think is the proper solution to that?

@nielash
Copy link
Collaborator Author

nielash commented Mar 9, 2024

the result of the Root() call is different for the two backends

Also, it looks to me like this used to be true, but was changed very recently in c69eb84

As a result, if all you have is f it seems like there's no way of determining which file it should point to...

Before this change, when cache.GetFn was called on a file rather than a
directory, two cache entries would be added (the file + its parent) but only one
of them would get pinned if the caller then called Pin(f). This left the other
one exposed to expiration if the ci.FsCacheExpireDuration was reached. This was
problematic because both entries point to the same Fs, and if one entry expires
while the other is pinned, the Shutdown method gets erroneously called on an Fs
that is still in use.

An example of the problem showed up in the Hasher backend, which uses the
Shutdown method to stop the bolt db used to store hashes. If a command was run
on a Hasher file (ex. `rclone md5sum --download hasher:somelargefile.zip`) and
hashing the file took longer than the --fs-cache-expire-duration (5m by default), the
bolt db was stopped before the hashing operation completed, resulting in an
error.

This change fixes the issue by ensuring that:
1. only one entry is added to the cache (the file's parent, not the file).
2. future lookups correctly find the entry regardless of whether they are called
	with the parent name or one of its children.
3. fs.ErrorIsFile is returned when (and only when) fsString points to a file
	(preserving the fix from rclone@8d5bc7f).

Note that f.Root() should always point to the parent dir as of rclone@c69eb84
@nielash
Copy link
Collaborator Author

nielash commented Mar 10, 2024

Here's an alternate fix -- see what you think. 6c808e6

It eliminates the duplicate entry, but the one it keeps is the parent rather than the file. This is because of what I mentioned above -- that after c69eb84 the f.Root() for remote:path and remote:path/file.txt are indistinguishable, so you can deduce a parent from a child but not the reverse.

This version should ensure that:

  1. only one entry is added to the cache (the file's parent, not the file).
  2. future lookups correctly find the entry regardless of whether they are called with the parent name or one of its children.
  3. fs.ErrorIsFile is returned when (and only when) fsString points to a file (preserving the fix from 8d5bc7f).

If we wanted to do the reverse and cache the file but not its parent, I think what we'd probably have to do is either revert c69eb84 and have f.Root() return a different path, or add a new required method to fs.Fs which returns the file remote path or "" if not a file. Additionally we'd have to double-check that fs.ConfigString(f) returns something different for files vs. dirs. Could potentially be useful for other reasons, but I'm not sure it's necessary here, as there's a less invasive workable solution.

@ncw
Copy link
Member

ncw commented Mar 13, 2024

Here's an alternate fix -- see what you think. 6c808e6

Wow, that is quite complicated! Probably too complicated for a point release - I'm really allergic to breaking stuff in a point release!

It eliminates the duplicate entry, but the one it keeps is the parent rather than the file. This is because of what I mentioned above -- that after c69eb84 the f.Root() for remote:path and remote:path/file.txt are indistinguishable, so you can deduce a parent from a child but not the reverse.

Note that commit is just one of many fixing the same problem in lots of backends!

This version should ensure that:

  1. only one entry is added to the cache (the file's parent, not the file).
  2. future lookups correctly find the entry regardless of whether they are called with the parent name or one of its children.
  3. fs.ErrorIsFile is returned when (and only when) fsString points to a file (preserving the fix from 8d5bc7f).

I think those seem correct.

Can we do it more simply?

One thing I was thinking is that if we changed the contents of the cache from fs.Fs to something like

struct {
    f fs.Fs
    files []string
}

Then when inserting a ErrorIsFile we'd just insert the parent but add an entry to files, and when looking up we'd look for the fsString as-is then look it up without the last path entry. If a member of files matches then we could return the f with ErrorIsFile.

If we wanted to do the reverse and cache the file but not its parent, I think what we'd probably have to do is either revert c69eb84 and have f.Root() return a different path, or add a new required method to fs.Fs which returns the file remote path or "" if not a file. Additionally we'd have to double-check that fs.ConfigString(f) returns something different for files vs. dirs. Could potentially be useful for other reasons, but I'm not sure it's necessary here, as there's a less invasive workable solution.

I don't want to undo c69eb84 as it fixed a small but important data loss bug.

The whole returning fs.ErrorIsFile is not very pleasant, but I haven't thought of anything better! One idea I had was that you always open a backend at the root with NewFs then you Chroot into the directory that you want, but that isn't much better.

@nielash
Copy link
Collaborator Author

nielash commented Mar 13, 2024

I'm really allergic to breaking stuff in a point release!

I would argue that it is already broken and this patch un-breaks it 😄 But I hear you!

It seems plausible that the same issue could be breaking copyto and moveto operations that take longer than 5 minutes, for backends with a Shutdown method. (I haven't tested this)

Can we do it more simply?

My first version (c4ac032) is still an option. It's much simpler. (Leave the duplicate but Pin/Unpin them in lockstep)

One thing I was thinking is that if we changed the contents of the cache from fs.Fs to something like

struct {
    f fs.Fs
    files []string
}

Then when inserting a ErrorIsFile we'd just insert the parent but add an entry to files, and when looking up we'd look for the fsString as-is then look it up without the last path entry. If a member of files matches then we could return the f with ErrorIsFile.

That's very similar to what my newest version is doing. I think it would work -- but it might mean having to call NewFs more often, because we'd lose all the files []string knowledge every time the cache expires.

I don't want to undo c69eb84 as it fixed a small but important data loss bug.

I don't think we need to.

The whole returning fs.ErrorIsFile is not very pleasant, but I haven't thought of anything better! One idea I had was that you always open a backend at the root with NewFs then you Chroot into the directory that you want, but that isn't much better.

Yeah I don't have any perfect solutions either. One idea might be to have NewFs and cache.Get return an optional remote string param like cmd.NewFsFile, which can be "" if not pointing to a file. (Maybe also an fs.Directory to help with #7652)

My overall observation as someone still learning all of this is that paths are much easier to reason about on the CLI than in the code -- f.Root() does not always mean what I'd expect!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants