Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dkg: sync privkeylock before exiting #2257

Merged
merged 9 commits into from
Jun 1, 2023

Conversation

gsora
Copy link
Collaborator

@gsora gsora commented May 31, 2023

Block dkg exit until privkeylock.Run() enclosing goroutine exits, ensuring that the private key lock file is always deleted.

Run() is supposed to be executed in a goroutine, and since we can't control the scheduler it might be scheduled after the main one exits, leading to the ctx.Done() code path to sometimes never be executed.

category: bug
ticket: #2258

gsora added 2 commits May 31, 2023 10:55
… deletion

Add `Done()` method that blocks until the `Run()` method finishes its course.

`Run()` is supposed to be executed in a goroutine, and since we can't control the scheduler it might be scheduled *after* the `main` one exits, leading to the `ctx.Done()` code path to never be executed.

Add `Done()`, caller can block on untile `Run()` finishes its execution.

Callers must cancel the `Run()` context, then call `Done()`.
Also modifies the DKG run test to check that all privkey locks are deleted.
@codecov
Copy link

codecov bot commented May 31, 2023

Codecov Report

Patch coverage: 62.50% and project coverage change: +0.03 🎉

Comparison is base (f59769f) 53.83% compared to head (8fb0237) 53.86%.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2257      +/-   ##
==========================================
+ Coverage   53.83%   53.86%   +0.03%     
==========================================
  Files         188      188              
  Lines       25513    25524      +11     
==========================================
+ Hits        13734    13749      +15     
  Misses      10092    10092              
+ Partials     1687     1683       -4     
Impacted Files Coverage Δ
app/app.go 6.96% <0.00%> (-0.02%) ⬇️
dkg/dkg.go 57.60% <50.00%> (+0.17%) ⬆️
app/privkeylock/privkeylock.go 60.34% <83.33%> (+5.44%) ⬆️

... and 11 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

dkg/dkg.go Outdated Show resolved Hide resolved
@gsora gsora requested a review from dB2510 May 31, 2023 09:20
dkg/dkg.go Outdated Show resolved Hide resolved
Co-authored-by: Abhishek Kumar <43061995+xenowits@users.noreply.github.com>
@gsora gsora requested a review from xenowits May 31, 2023 11:27
dkg/dkg.go Outdated Show resolved Hide resolved
Co-authored-by: Abhishek Kumar <43061995+xenowits@users.noreply.github.com>
@gsora gsora requested a review from xenowits May 31, 2023 11:53
@xenowits
Copy link
Contributor

@gsora hear me out on this:

  • Instead of using a sync.WaitGroup, let's just add a quit channel:
// Service is a private key locking service.
type Service struct {
         ...
	quit            chan struct{}   // New
}
  • In the Stop() method, we simply close the quit chan:
func (s *Service) Done() {
	close(s.quit)
}
  • We add another case in the select block:
case <-ctx.Done():
	cleanup = true
case <-h.quit:
	cleanup = true
  • Break out of the loop if cleanup == true
  • Finally, do the delete file stuff

It essentially does the same thing as the waitgroup but the quit chan pattern is very common in the charon codebase (see scheduler, aggsigdb etc). The DKG process also doesn't have to wait for privkeylock deletion to complete, since it will eventually happen.

Wdyt?

@gsora
Copy link
Collaborator Author

gsora commented May 31, 2023

@xenowits

The DKG process also doesn't have to wait for privkeylock deletion to complete, since it will eventually happen.

That is not correct: since the deletion happens in a different goroutine you are not sure if it'll be scheduled.

I just implemented your solution as a test, and while it does work more often than just relying on defer cancel(), it sometimes doesn't work.

The idea here is having the main goroutine waiting until the child one has actually deleted the file.

@gsora
Copy link
Collaborator Author

gsora commented May 31, 2023

I did implement a channel-based solution in a call with @dB2510: it worked fine.

I fell back on sync.WorkGroup because I wanted to not think about side-effects and rely on a piece of stdlib that has been build to do exactly this: synchronize different goroutines.

A thing we could do to make it more obvious is spawning the background goroutine inside privkeylock.Run() instead of leaving the caller to do that.

@gsora
Copy link
Collaborator Author

gsora commented May 31, 2023

Actually thinking about loud, maybe this fix shouldn't be in privkeylock, rather it must just concern dkg.

Implementing a different solution as we speak.

…dn't have

dkg has the concurrency problem, so make it responsible for the fix as well.
@gsora gsora changed the title app/privkeylock: sync callers before delete dkg: sync privkeylock before exiting May 31, 2023
dkg/dkg.go Outdated Show resolved Hide resolved
dkg/dkg.go Outdated Show resolved Hide resolved
dkg/dkg.go Outdated Show resolved Hide resolved
dkg/dkg.go Outdated Show resolved Hide resolved
Co-authored-by: Abhishek Kumar <43061995+xenowits@users.noreply.github.com>
Copy link
Contributor

@xenowits xenowits left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! I think this is a more simpler approach 👍

}()

// Stop it on exit.
defer lockSvc.Close()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: move this line before go func, after we initialise lockSvc

go func(ctx context.Context) {
if err := lockSvc.Run(ctx); err != nil {
log.Error(ctx, "Error locking private key file", err)
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@corverroos why do we need a separate block for this?

}
}(ctx)

// Start it async
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Start it async
// Start private key lock service async.

@xenowits xenowits added the merge when ready Indicates bulldozer bot may merge when all checks pass label Jun 1, 2023
@obol-bulldozer obol-bulldozer bot merged commit 881227e into main Jun 1, 2023
12 checks passed
@obol-bulldozer obol-bulldozer bot deleted the gsora/privkeylock_sync_deletion branch June 1, 2023 11:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
merge when ready Indicates bulldozer bot may merge when all checks pass
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants