Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot fetch pytorch repo as it can't acquire lock for reference update #595

Closed
Byron opened this issue Nov 15, 2022 · 4 comments
Closed

Comments

@Byron
Copy link
Owner

Byron commented Nov 15, 2022

Reproduction

git clone https://github.com/pytorch/pytorch
cd pytorch
gix -v fetch

The above results in the following error:

❯ gix -v fetch
 14:54:16 read pack done 184.6KB in 0.64s (288.0KB/s)
 14:54:16  indexing 282 objects were resolved into 179 objects during thin-pack resolution
 14:54:16  indexing done 179.0 objects in 0.53s (340.0 objects/s)
 14:54:16 decompressing done 3.5MB in 0.53s (6.7MB/s)
 14:54:16     Resolving done 179.0 objects in 0.06s (3.2k objects/s)
 14:54:16      Decoding done 6.6MB in 0.06s (119.2MB/s)
 14:54:16 writing index file done 6.1KB in 0.00s (25.1MB/s)
 14:54:16  create index file done 179.0 objects in 0.58s (307.0 objects/s)
Error: Failed to update references to their new position to match their remote locations

Caused by:
    0: A lock could not be obtained for reference "refs/remotes/origin/update-test-distribute"
    1: The lock for resource './.git/refs/remotes/origin/update-test-distribute could not be obtained after 0.10s after 8 attempt(s). The lockfile at './.git/refs/remotes/origin/update-test-distribute.lock' might need manual deletion.
@Byron Byron mentioned this issue Nov 15, 2022
1 task
@Byron
Copy link
Owner Author

Byron commented Nov 15, 2022

It turns out that despite best attempts to release tempfiles right after using them, this doesn't actually happen here which leads to the system exhausting its file handles. This shows up as AlreadyExists with a custom error string which is interpreted as the lock already existing (even though it's not but it cannot be created either).

From all I can tell, the resources are cleaned up immediately, and it's a bit of a puzzle as to why this isn't actually happening.

@Byron
Copy link
Owner Author

Byron commented Nov 15, 2022

Since the error kind is std::io::ErrorKind::Custom it was worth looking at the tempfile crate which does create named tempfiles under the hood.

It turns out that it ultimately creates these errors here. When looking at it permanently failing the program actually has no tempfile open anymore (they are held only as long as needed, not more than one at a time), yet this one keeps erroring. Strangely enough this happens because the system still yields errors.

@Byron
Copy link
Owner Author

Byron commented Nov 15, 2022

It took longer than I admit (due to tooling also being case-sensistive by default) to figure out the following:

❯ gix remote refs | rg -i update-test-distribute
f875027713d1275d0ee58d9762d8b72f8fa78e54 refs/heads/Update-test-distribute
b19b08e16061db5985e7391dbb65711b474fcf94 refs/heads/update-test-distribute

Indeed, there are two refs with the same name on case-sensitive file systems. Along with tempfile::NamedTempFile jumping to conclusions about what really happens, the whole case is more confusing than it should be.

There are two issues I see

  • fetch creates a lot of lock files and writes them with the same value, costing more than it should as the value is unchanged and we can prove it. It should special-case this to avoid this work, which would case fetches in pytorch to cause nearly 10k of similar files to be created, written to and moved into place (with no effective change). This is now fixed as they will only be created and then removed, without being written to or moved over the existing file.
  • there is two refs with the same name, and git handles it by automatically pushing them into packed-refs. Git seems to have some special handling about not getting tripped up when creating loose lock files as well.

The solution here would be to avoid trying to handle colliding ref-names in the same transaction, and of course that should be done transparently. Tests for this are missing but could easily be added and tested on sensitive file systems like the ones on MacOS or Windows.

  • 🪳for clones and fetches, and if a packed transaction is present, we can avoid creating per-loose-ref locks at all as we know everything is locked if packed-refs is locked. Those who want to edit have to obtain a lock first. ✅
  • refs in packed-refs are always case-sensitive
  • when editing refs a and A these would end up as loose refs and conflict.
    * when editing them one after another, they something invalid would happen and we create ambiguity. It requires thorough testing to learn what git does to handle this. Does it really manage to fully workaround case-sensitive file systems with refs? Answer is NO, while there is a loose ref with a colliding name, it can't continue like with gco -b a && gco -b A. So as long as packed-refs avoid creating individual locks if there is no effective change, we should be good.
  • 🪳another issue to investigate: it doesn't seem to remove empty dirs. Ineffectively dropped locks do not cleanup after themselves and even though they have the information. Problem probably is that there are many nested directory due to a variety of locks, so we would have to remember the graph to properly cleanup. Can't fix, but can avoid locks in the first place, back to the first bug. ✅

Byron added a commit that referenced this issue Nov 15, 2022
…595)

Instead of moving them into place, we just drop them, without ever
writing into them.
Byron added a commit that referenced this issue Nov 16, 2022
…a pending change. (#595)

As opposed to dropping the Transaction, this method allows to obtain all
edits that would have been applied.
Byron added a commit that referenced this issue Nov 16, 2022
…and fail as apparently probing really needs the right directory to work
as expected.
Or is there anything else?
Byron added a commit that referenced this issue Nov 16, 2022
Right now they only appear to be locks already taken, not the best UX
for sure, but a starting point.
Byron added a commit that referenced this issue Nov 16, 2022
…ating the iterator. (#595)

Previously, it would fail on first iteration, making it seem like there
is one reference even though it's just an error stating that the base
cannot be read.

This is clearly worse than making a metadata check on the filesystem,
no matter how unlikely the case.
Byron added a commit that referenced this issue Nov 16, 2022
…ase, correctly yield zero references. (#595)

Previously it reported an error, now it does not and instead performs no
iteration, which is more helpful to the user of the API I believe as
they won't randomly fail just because somebody deleted the `refs`
folder.
Byron added a commit that referenced this issue Nov 16, 2022
…ase-insensitie filesystems*. (#595)

The asterisk indicates that this only works if packed-refs are present
and these references are written straight to packed references without
ever trying to handle the otherwise conflicting loose reference files.

This is done by leveraging the fact that in presence of packed-refs
or a pending creation of packed-refs, there is no need to create
per-file locks as concurrent transactions also have to obtain the
packed-refs lock and fail (or wait) until it's done.
Byron added a commit that referenced this issue Nov 16, 2022
…ock is helt. (#595)

The logic is similar to what's done with updates.
Byron added a commit that referenced this issue Nov 16, 2022
#595)

Previously it was possible for symbolic refs to be deleted right after
they have been created or updated as they were included in the set of
refs that was assumed to be part of packed-refs, which isn't the case
for symbolic refs.
@Byron
Copy link
Owner Author

Byron commented Nov 17, 2022

Closing as a fix is available in main. As a positive side-effect this will also greatly increase the performance when cloning or fetching repos with huge amounts of refs.

@Byron Byron closed this as completed Nov 17, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant