Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Files will now close if held open too long #239

Merged
merged 1 commit into from
Sep 3, 2023
Merged

Files will now close if held open too long #239

merged 1 commit into from
Sep 3, 2023

Conversation

allada
Copy link
Member

@allada allada commented Aug 26, 2023

In testing Buck2, it turns out they hammer the scheduler with jobs and assume it can keep up (which is good). On local testing, where all services are on the same machine it was deadlocking because it was performing the following operations:

  1. Open file1 for reading
  2. Open file2 for writing
  3. Streaming file1 -> file2

Since we allow users to limit the number of open files at any given time, this was deadlocking because file1 was held open waiting for file2 to open, which was waiting for a file to be closed. Since buck2 goes crazy, it was causing a deadlock.

In most production systems this is not an issue because the CAS is separated from the workers, but rarely might happen on the workers if the max_open_files was set too low.

To get around this issue ResumeableFileSlot is introduced. It allows callers to use a timeout and call .close_file() on it and the next time the struct is used it will re-open the file.

related #222


This change is Reviewable

Copy link
Member Author

@allada allada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+@chrisstaite-menlo

FYI, I'll be on vacation for the next week so no followups this next week, so no rush.

Reviewable status: 0 of 20 files reviewed, all discussions resolved (waiting on @chrisstaite-menlo)

In testing Buck2, it turns out they hammer the scheduler with jobs
and assume it can keep up (which is good). On local testing, where
all services are on the same machine it was deadlocking because
it was performing the following operations:
1. Open file1 for reading
2. Open file2 for writing
3. Streaming file1 -> file2

Since we allow users to limit the number of open files at any given
time, this was deadlocking because file1 was held open waiting
for file2 to open, which was waiting for a file to be closed. Since
buck2 goes crazy, it was causing a deadlock.

In most production systems this is not an issue because the CAS
is separated from the workers, but rarely might happen on the workers
if the `max_open_files` was set too low.

To get around this issue `ResumeableFileSlot` is introduced. It
allows callers to use a timeout and call `.close_file()` on it
and the next time the struct is used it will re-open the file.

related #222
closes #238
Copy link
Collaborator

@chrisstaite-menlo chrisstaite-menlo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 20 of 20 files at r1, all commit messages.
Reviewable status: :shipit: complete! all files reviewed, all discussions resolved (waiting on @allada)

@allada allada merged commit 67b90e2 into main Sep 3, 2023
6 checks passed
@allada allada deleted the fs-changes branch September 3, 2023 16:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants