Skip to content

Commit

Permalink
Merge pull request juju#13084 from manadart/2.8-uniter-graceful-lock-…
Browse files Browse the repository at this point in the history
…cancellation

juju#13084

The linked bug describes a scenario where:
- At least one hook execution is mid-flight.
- The executor is waiting to acquire a machine lock.
- A model migration is commenced, causing the migration-inactive flag to drop.
- The lock acquisition is cancelled, which is represented as an error that flows up and out of the resolver loop.
- The unit goes briefly into an error state.
- This error state causes the migration validation to fail.

Here, we change the resolver loop so that if the cause of the error from `executor.Run` is `mutex.ErrCancelled` then we return `ErrRestart`. This should be the safest thing to do in the general case. In the specific case this results in the same behaviour as translating a fortress lockdown anyway (see `TranslateFortressErrors`).

This should mean a clean shutdown, successful validation (the initial migration phase), and then migration progression.

## QA steps

- Bootstrap 2 controllers.
- On the first, deploy several applications to a single machine.
- I used `juju run <unit> "sleep 10"` on multiple units at once.
- Concurrently, migrate the model to the second controller.
- We want to see in the logs that the lock acquisition is cancelled, but that migration progresses:
```
unit-mongodb-0: 15:58:14 INFO juju.worker.migrationminion migration phase is now: QUIESCE
unit-telegraf-4: 15:58:14 INFO juju.worker.migrationminion migration phase is now: QUIESCE
unit-cassandra-0: 15:58:14 INFO juju.worker.logger logger worker stopped
unit-telegraf-4: 15:58:14 INFO juju.worker.logger logger worker stopped
unit-telegraf-4: 15:58:14 WARNING juju.worker.uniter.resolver executor lock acquisition cancelled <-- HERE
unit-ubuntu-0: 15:58:14 INFO juju.worker.migrationminion migration phase is now: QUIESCE
unit-postgresql-0: 15:58:14 INFO juju.worker.migrationminion migration phase is now: QUIESCE
unit-cassandra-0: 15:58:14 INFO juju.worker.migrationminion migration phase is now: QUIESCE
```

## Documentation changes

None.

## Bug reference

https://bugs.launchpad.net/juju/+bug/1918680
  • Loading branch information
jujubot committed Jun 17, 2021
2 parents bf308d1 + bec7c38 commit 98e51af
Show file tree
Hide file tree
Showing 4 changed files with 45 additions and 5 deletions.
2 changes: 1 addition & 1 deletion core/machinelock/machinelock.go
Original file line number Diff line number Diff line change
Expand Up @@ -131,7 +131,7 @@ func (s Spec) Validate() error {
}
}
if s.Worker == "" {
return errors.NotValidf("mssing Worker")
return errors.NotValidf("missing Worker")
}
return nil
}
Expand Down
2 changes: 1 addition & 1 deletion worker/uniter/operation/executor.go
Original file line number Diff line number Diff line change
Expand Up @@ -87,7 +87,7 @@ func (x *executor) Run(op Operation, remoteStateChange <-chan remotestate.Snapsh
if op.NeedsGlobalMachineLock() {
releaser, err := x.acquireMachineLock(op.String())
if err != nil {
return errors.Annotatef(err, "could not acquire %q lock for %s", op, x.unitName)
return errors.Annotatef(err, "acquiring %q lock for %s", op, x.unitName)
}
defer x.logger.Debugf("lock released for %s", x.unitName)
defer releaser()
Expand Down
15 changes: 15 additions & 0 deletions worker/uniter/resolver/loop.go
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@ package resolver
import (
"github.com/juju/charm/v7/hooks"
"github.com/juju/errors"
"github.com/juju/mutex"

"github.com/juju/juju/core/lxdprofile"
"github.com/juju/juju/worker/fortress"
Expand All @@ -20,6 +21,7 @@ var logger interface{}
// Logger represents the logging methods used in this package.
type Logger interface {
Errorf(string, ...interface{})
Warningf(string, ...interface{})
Debugf(string, ...interface{})
Tracef(string, ...interface{})
}
Expand Down Expand Up @@ -114,6 +116,19 @@ func Loop(cfg LoopConfig, localState *LocalState) error {
cfg.Logger.Tracef("running op: %v", op)
if err := cfg.Executor.Run(op, remoteStateChanged); err != nil {
close(done)

if errors.Cause(err) == mutex.ErrCancelled {
// If the lock acquisition was cancelled (such as when the
// migration-inactive flag drops, we do not want the
// resolver to surface that error. This puts the agent into
// the "failed" state, which causes the initial migration
// validation phase to fail.
// The safest thing to do is to bounce the loop and
// reevaluate our state, which is what happens upon a
// fortress error anyway (uniter.TranslateFortressErrors).
cfg.Logger.Warningf("executor lock acquisition cancelled")
return ErrRestart
}
return errors.Trace(err)
}
close(done)
Expand Down
31 changes: 28 additions & 3 deletions worker/uniter/resolver/loop_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ import (
"github.com/juju/charm/v7"
"github.com/juju/charm/v7/hooks"
"github.com/juju/loggo"

"github.com/juju/mutex"
envtesting "github.com/juju/testing"
jc "github.com/juju/testing/checkers"
gc "gopkg.in/check.v1"
Expand Down Expand Up @@ -259,7 +259,7 @@ func (s *LoopSuite) TestLoopWithChange(c *gc.C) {
}

func (s *LoopSuite) TestRunFails(c *gc.C) {
s.executor.SetErrors(errors.New("Run fails"))
s.executor.SetErrors(errors.New("run fails"))
s.resolver = resolver.ResolverFunc(func(
_ resolver.LocalState,
_ remotestate.Snapshot,
Expand All @@ -268,7 +268,7 @@ func (s *LoopSuite) TestRunFails(c *gc.C) {
return mockOp{}, nil
})
_, err := s.loop()
c.Assert(err, gc.ErrorMatches, "Run fails")
c.Assert(err, gc.ErrorMatches, "run fails")
}

func (s *LoopSuite) TestNextOpFails(c *gc.C) {
Expand Down Expand Up @@ -412,6 +412,31 @@ func (s *LoopSuite) testCheckCharmUpgradeCallsRun(c *gc.C) {
s.executor.CheckCallNames(c, "State", "State", "Run", "State")
}

func (s *LoopSuite) TestCancelledLockAcquisitionCausesRestart(c *gc.C) {
s.executor = &mockOpExecutor{
Executor: nil,
Stub: envtesting.Stub{},
st: operation.State{
Started: true,
Kind: operation.Continue,
},
run: func(operation.Operation, <-chan remotestate.Snapshot) error {
return mutex.ErrCancelled
},
}

s.resolver = resolver.ResolverFunc(func(
_ resolver.LocalState,
_ remotestate.Snapshot,
_ operation.Factory,
) (operation.Operation, error) {
return &mockOp{}, nil
})

_, err := s.loop()
c.Assert(err, gc.Equals, resolver.ErrRestart)
}

func waitChannel(c *gc.C, ch <-chan interface{}, activity string) interface{} {
select {
case v := <-ch:
Expand Down

0 comments on commit 98e51af

Please sign in to comment.