New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
En 6839 dump goroutines #2009
En 6839 dump goroutines #2009
Conversation
…9-dump-goroutines # Conflicts: # consensus/spos/bls/blsSubroundsFactory.go # consensus/spos/bls/blsSubroundsFactory_test.go # consensus/spos/errors.go # consensus/spos/sposFactory/sposFactory.go # consensus/spos/sposFactory/sposFactory_test.go # node/node.go
consensus/chronology/chronology.go
Outdated
|
||
cb := func(alarmID string) { | ||
buffer := new(bytes.Buffer) | ||
err := pprof.Lookup("goroutine").WriteTo(buffer, 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can also print the stack trace for each go routine, if needed by using
buff := make([]byte, 1<<27)
num := runtime.Stack(buff, true)
dump := string(buff[:num])
This one looks exactly like the one we got when the code panics
…9-dump-goroutines # Conflicts: # cmd/node/main.go
|
||
as.event <- evt | ||
|
||
as.Add(callback, duration, alarmID) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we call Add here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because the alarm was canceled before, so it needs to be readded.
@@ -687,3 +687,15 @@ func WithPeerHonestyHandler(peerHonestyHandler consensus.PeerHonestyHandler) Opt | |||
return nil | |||
} | |||
} | |||
|
|||
// WithWatchdogTimer sets up a watchdog for the Node | |||
func WithWatchdogTimer(watchdog core.WatchdogTimer) Option { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add test for this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
consensus/chronology/chronology.go
Outdated
@@ -151,6 +165,8 @@ func (chr *chronology) startRound() { | |||
return | |||
} | |||
|
|||
chr.watchdog.Reset(chronologyAlarmID) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you move the reset at the beginning of updateRound() maybe where there is a change of round, otherwise for example in BON where we start with negative rounds, there will be a lot of watchdog expiries before round 0 is reached.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
core/watchdog/watchdog.go
Outdated
|
||
// SetDefault sets the default alarm with the specified duration. | ||
// When the default alarm expires, the goroutines stack traces will be logged, and the node will gracefully close. | ||
func (w *watchdog) SetDefault(duration time.Duration, alarmID string) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
change naming alarmID to watchdogID
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
core/watchdog/watchdog.go
Outdated
// SetDefault sets the default alarm with the specified duration. | ||
// When the default alarm expires, the goroutines stack traces will be logged, and the node will gracefully close. | ||
func (w *watchdog) SetDefault(duration time.Duration, alarmID string) { | ||
cb := func(alarmID string) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you declare the function outside e.g func defaultWatchdogExpiry() and only set it here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
System tests passed.
If consensus/chronology component is stuck, log the goroutines stack dump, and then gracefully close the node. Kill the node process only if the gracefully close could not complete.