Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update reminder state with CAS instead of Put #193

Merged
merged 1 commit into from Oct 15, 2017

Conversation

Gerrrr
Copy link
Collaborator

@Gerrrr Gerrrr commented Jul 12, 2017

Problem

After a network failure, we managed to get a hanging reminder in Consul-Alerts. The reminders looked like this:

{"Node":"Aleksandrs-MacBook Pro.local",
 "ServiceId":"",
 "Service":"",
 "CheckId":"serfHealth",
  "Check":"Serf Health Status",
  "Status":"critical",
  "Output":"Agent alive and reachable",
  "Notes":"","Interval":300,
  "RmdCheck":"2017-07-11T08:49:44.378586605+02:00",
  "NotifList":{"email":true},
 ...
}

The health check that initially triggered the reminder was already passing, but the reminder still existed. NB its inconsistent state where the status is critical, but according to the output the agent is alive and reachable.

Solution

The problem was caused by a race condition between consul/client.UpdateCheckData (#L346) that updates the output of a reminder if the output of the health check is changed and check-handler.notify (#L130) that deletes the reminder after the check is back to passing.

To reproduce the race condition I modified UpdateCheckData such that the update happens always after the reminder is deleted. The relevant part of the function:

if remindermap["Output"] != health.Output {
  log.Printf("Updating reminder data for %s", reminderkey)

  remindermap["Output"] = health.Output
  newreminder, _ := json.Marshal(remindermap)
  go func() {
    time.Sleep(20 * time.Second)
    _, err := kvApi.Put(&consulapi.KVPair{Key: reminderkey, Value: newreminder}, nil)
    if err != nil {
      log.Println("Unable to set kv value: ", err)
    }
  }()
}

Then, I started Consul and Nomad in dev mode on the local machine:

 $ consul agent -dev
==> Starting Consul agent...
==> Starting Consul agent RPC...
==> Consul agent running!
           Version: 'v0.7.5'
           Node ID: 'fb2170a8-257d-3c64-b14d-bc06cc94e34c'
         Node name: 'Aleksandrs-MacBook-Pro.local'
        Datacenter: 'dc1'
            Server: true (bootstrap: false)
       Client Addr: 127.0.0.1 (HTTP: 8500, HTTPS: -1, DNS: 8600, RPC: 8400)
      Cluster Addr: 127.0.0.1 (LAN: 8301, WAN: 8302)
    Gossip encrypt: false, RPC-TLS: false, TLS-Incoming: false
             Atlas: <disabled>

==> Log data will now stream in as it occurs:
...

$ nomad agent -dev
    No configuration files loaded
==> Starting Nomad agent...
==> Nomad agent configuration:

                 Atlas: <disabled>
                Client: true
             Log Level: DEBUG
                Region: global (DC: dc1)
                Server: true
               Version: 0.5.6

==> Nomad agent started! Log data will stream in below:
...

consul-alerts config:

consul-alerts/config/notif-profiles/log_with_reminders

{
  "Interval": 1,
  "NotifList": {
    "log":true,
    "email":false
  }
}

consul-alerts/config/notif-selection/hosts/Aleksandrs-MacBook-Pro.local

log_with_reminders

To create a reminder I blocked the port 4646. On OS X I add the following line to the file /private/etc/pf.conf

block drop quick on lo0 proto tcp from any to any port = 4646

Then, reloaded the configuration:

sudo pfctl -f /etc/pf.conf

After the reminder was created I unblocked the port. The reminder was first removed by check-handler.notify and then added back with the updated output by the code above.

To fix the issue I used kvApi.CAS instead of kvApi.Put that checks ModifyIndex of the entry before updating it. More information about CAS update is available here

@Gerrrr Gerrrr force-pushed the bug/race-updating-reminders branch from de77789 to 05ab018 Compare July 19, 2017 13:13
@Gerrrr
Copy link
Collaborator Author

Gerrrr commented Oct 7, 2017

@fusiondog Can you please review?

@Gerrrr Gerrrr requested a review from fusiondog October 7, 2017 19:29
@fusiondog fusiondog merged commit bc2a84e into AcalephStorage:master Oct 15, 2017
dagvl added a commit to vimond/consul-alerts that referenced this pull request Jul 5, 2018
* upstream/branch/master: (68 commits)
  End-to-end tests in Travis against Consul 1.0.x (AcalephStorage#226) (AcalephStorage#228)
  Fix typo in README
  Notify Mattermost via Incoming Webhooks (AcalephStorage#206)
  Test in travis against multiple versions of Consul (AcalephStorage#202)
  update reminder state with CAS instead of Put (AcalephStorage#193)
  Delete a reminder if the corresponding healthcheck is blacklisted (AcalephStorage#201)
  Trim config parameters (AcalephStorage#203)
  Giving clearer examples on slack setup
  Add tests for SlackNotifier unmarshalling from json
  Enable cluster name override on slack integration
  Update README.md (AcalephStorage#196)
  ACL example (AcalephStorage#194)
  Add cluster-name to awssns notifier
  Add templating to awssns notifier
  Add test for aws-sns-notifier
  Extract templating so it can be used from multiple notifiers
  bump version (AcalephStorage#189)
  Travis runs tests now (AcalephStorage#188)
  README entry for the specific tresholds
  Tests for consul.GetChangeThreshold
  ...

# Conflicts:
#	Dockerfile
#	notifier/opsgenie-notifier.go
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants