Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

En 6839 dump goroutines #2009

Merged
merged 11 commits into from Jun 25, 2020
Merged

En 6839 dump goroutines #2009

merged 11 commits into from Jun 25, 2020

Conversation

BeniaminDrasovean
Copy link
Contributor

If consensus/chronology component is stuck, log the goroutines stack dump, and then gracefully close the node. Kill the node process only if the gracefully close could not complete.

…9-dump-goroutines

# Conflicts:
#	consensus/spos/bls/blsSubroundsFactory.go
#	consensus/spos/bls/blsSubroundsFactory_test.go
#	consensus/spos/errors.go
#	consensus/spos/sposFactory/sposFactory.go
#	consensus/spos/sposFactory/sposFactory_test.go
#	node/node.go

cb := func(alarmID string) {
buffer := new(bytes.Buffer)
err := pprof.Lookup("goroutine").WriteTo(buffer, 1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can also print the stack trace for each go routine, if needed by using

    buff := make([]byte, 1<<27)
    num := runtime.Stack(buff, true)
    dump := string(buff[:num])

This one looks exactly like the one we got when the code panics

consensus/chronology/chronology.go Outdated Show resolved Hide resolved
consensus/chronology/chronology.go Outdated Show resolved Hide resolved
consensus/chronology/chronology_test.go Outdated Show resolved Hide resolved
consensus/spos/bls/subroundStartRound.go Outdated Show resolved Hide resolved
consensus/chronology/chronology.go Outdated Show resolved Hide resolved
consensus/chronology/chronology.go Outdated Show resolved Hide resolved
consensus/chronology/chronology.go Outdated Show resolved Hide resolved

as.event <- evt

as.Add(callback, duration, alarmID)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we call Add here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because the alarm was canceled before, so it needs to be readded.

@@ -687,3 +687,15 @@ func WithPeerHonestyHandler(peerHonestyHandler consensus.PeerHonestyHandler) Opt
return nil
}
}

// WithWatchdogTimer sets up a watchdog for the Node
func WithWatchdogTimer(watchdog core.WatchdogTimer) Option {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add test for this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -151,6 +165,8 @@ func (chr *chronology) startRound() {
return
}

chr.watchdog.Reset(chronologyAlarmID)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you move the reset at the beginning of updateRound() maybe where there is a change of round, otherwise for example in BON where we start with negative rounds, there will be a lot of watchdog expiries before round 0 is reached.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


// SetDefault sets the default alarm with the specified duration.
// When the default alarm expires, the goroutines stack traces will be logged, and the node will gracefully close.
func (w *watchdog) SetDefault(duration time.Duration, alarmID string) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change naming alarmID to watchdogID

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

// SetDefault sets the default alarm with the specified duration.
// When the default alarm expires, the goroutines stack traces will be logged, and the node will gracefully close.
func (w *watchdog) SetDefault(duration time.Duration, alarmID string) {
cb := func(alarmID string) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you declare the function outside e.g func defaultWatchdogExpiry() and only set it here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Contributor

@LucianMincu LucianMincu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

System tests passed.

@LucianMincu LucianMincu merged commit 60fb2d5 into development Jun 25, 2020
@LucianMincu LucianMincu deleted the EN-6839-dump-goroutines branch June 25, 2020 16:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants