Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Setting redis as coordination backend #169

Merged
merged 6 commits into from
Jan 19, 2021

Conversation

arms11
Copy link
Contributor

@arms11 arms11 commented Jan 12, 2021

Closes #94
Fixes #167

For the time-being...

  1. The release name is hard-wired in the connection string. This means, please use st2ha as the release name.
  2. Password is not enabled.
  3. sentinel=mymaster in the connection string has been set assuming the name of the master is the name of the masterset.

Below is the error I have been receiving...

2021-01-12 03:13:00,478 ERROR [-] (PID=1) ST2 API quit due to exception.
Traceback (most recent call last):
  File "/opt/stackstorm/st2/lib/python3.6/site-packages/tooz/drivers/redis.py", line 47, in _translate_failures
    yield
  File "/opt/stackstorm/st2/lib/python3.6/site-packages/tooz/drivers/redis.py", line 457, in _start
    self._server_info = self._client.info()
  File "/opt/stackstorm/st2/lib/python3.6/site-packages/redis/client.py", line 1304, in info
    return self.execute_command('INFO')
  File "/opt/stackstorm/st2/lib/python3.6/site-packages/redis/client.py", line 898, in execute_command
    conn = self.connection or pool.get_connection(command_name, **options)
  File "/opt/stackstorm/st2/lib/python3.6/site-packages/redis/connection.py", line 1192, in get_connection
    connection.connect()
  File "/opt/stackstorm/st2/lib/python3.6/site-packages/redis/sentinel.py", line 44, in connect
    self.connect_to(self.connection_pool.get_master_address())
  File "/opt/stackstorm/st2/lib/python3.6/site-packages/redis/sentinel.py", line 107, in get_master_address
    self.service_name)
  File "/opt/stackstorm/st2/lib/python3.6/site-packages/redis/sentinel.py", line 219, in discover_master
    raise MasterNotFoundError("No master found for %r" % (service_name,))
redis.sentinel.MasterNotFoundError: No master found for 'mymaster'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/opt/stackstorm/st2/lib/python3.6/site-packages/st2api/cmd/api.py", line 86, in main
    return _run_server()
  File "/opt/stackstorm/st2/lib/python3.6/site-packages/st2api/cmd/api.py", line 75, in _run_server
    wsgi.server(sock, app.setup_app(), custom_pool=worker_pool, log=LOG, log_output=False)
  File "/opt/stackstorm/st2/lib/python3.6/site-packages/st2api/app.py", line 77, in setup_app
    router.add_spec(spec, transforms=transforms)
  File "/opt/stackstorm/st2/lib/python3.6/site-packages/st2common/router.py", line 213, in add_spec
    __import__(module_name)
  File "/opt/stackstorm/st2/lib/python3.6/site-packages/st2api/controllers/v1/keyvalue.py", line 437, in <module>
    key_value_pair_controller = KeyValuePairController()
  File "/opt/stackstorm/st2/lib/python3.6/site-packages/st2api/controllers/v1/keyvalue.py", line 62, in __init__
    self._coordinator = coordination.get_coordinator()
  File "/opt/stackstorm/st2/lib/python3.6/site-packages/st2common/services/coordination.py", line 227, in get_coordinator
    COORDINATOR = coordinator_setup(start_heart=start_heart)
  File "/opt/stackstorm/st2/lib/python3.6/site-packages/st2common/services/coordination.py", line 200, in coordinator_setup
    coordinator.start(start_heart=start_heart)
  File "/opt/stackstorm/st2/lib/python3.6/site-packages/tooz/coordination.py", line 690, in start
    super(CoordinationDriverWithExecutor, self).start(start_heart)
  File "/opt/stackstorm/st2/lib/python3.6/site-packages/tooz/coordination.py", line 426, in start
    self._start()
  File "/opt/stackstorm/st2/lib/python3.6/site-packages/tooz/drivers/redis.py", line 457, in _start
    self._server_info = self._client.info()
  File "/usr/lib/python3.6/contextlib.py", line 99, in __exit__
    self.gen.throw(type, value, traceback)
  File "/opt/stackstorm/st2/lib/python3.6/site-packages/tooz/drivers/redis.py", line 51, in _translate_failures
    cause=e)
  File "/opt/stackstorm/st2/lib/python3.6/site-packages/tooz/utils.py", line 225, in raise_with_cause
    excutils.raise_with_cause(exc_cls, message, *args, **kwargs)
  File "/opt/stackstorm/st2/lib/python3.6/site-packages/oslo_utils/excutils.py", line 143, in raise_with_cause
    six.raise_from(exc_cls(message, *args, **kwargs), kwargs.get('cause'))
  File "<string>", line 3, in raise_from
tooz.coordination.ToozConnectionError: No master found for 'mymaster'


@pull-request-size pull-request-size bot added the size/M PR that changes 30-99 lines. Good size to review. label Jan 12, 2021
@arms11 arms11 marked this pull request as draft January 12, 2021 03:26
@arm4b
Copy link
Member

arm4b commented Jan 12, 2021

Thanks a lot @arms11! 👍

Will try to run this setup and play with the config options.

Comment on lines +61 to +76
@test 'st2 key/value operations are functional' {
run st2 key set foo bar
assert_success

run st2 key get foo
assert_success
assert_line --partial 'bar'

run st2 key delete foo
assert_line --partial '"foo" has been successfully deleted'
assert_success

run st2 key get foo
assert_line --partial '"foo" is not found'
assert_failure
}
Copy link
Member

@arm4b arm4b Jan 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on #167, included some tests here to verify the st2 K/V functionality.

Copy link
Member

@arm4b arm4b left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @arms11!

I just tested this config and it works OK on a local minikube 👍
Same for CircleCI e2e tests in Github, check the status:
✔️ https://app.circleci.com/pipelines/github/StackStorm/stackstorm-ha/906/workflows/c98fb252-0ac4-47f8-85b3-71a8f29ddd7f

Having said that, everything works well, it's great start!
Let's continue on connection string templating and removing the etcd dependency so we can merge this fix.

@arm4b arm4b added the feature label Jan 16, 2021
@arms11
Copy link
Contributor Author

arms11 commented Jan 17, 2021

Thanks @armab . That's great and concerning at the same time. Not sure how this is working for you and not for me! I do have minikube with the 1.20 K8s which is the latest. I will verify this in my separate EKS cluster which is 1.18 to be sure. If I do see difference, while I will move ahead with what you advised, we can raise an issue for bitnami maintainers of the chart and see if they have observed the same.

@arms11
Copy link
Contributor Author

arms11 commented Jan 18, 2021

@armab - I have pushed the changes for Redis.
Tested the following:

  1. Ensure the reported use case passes.
  2. Redis cluster forms with 3 replicas and Sentinel Enabled. (passes on k8s 1.18)
  3. Sentinel enabled set to false fails the validation. So, if redis is enabled, it must be used with Sentinel enabled as per the tooz lib documentation.
  4. Removed references of etcd from documentation.

templates/_helpers.tpl Outdated Show resolved Hide resolved
@arm4b
Copy link
Member

arm4b commented Jan 18, 2021

Besides of the comment above, we'll also need a changelog record for this PR and I think it will be good to merge! 👍

@arms11 arms11 marked this pull request as ready for review January 19, 2021 00:41
CHANGELOG.md Outdated Show resolved Hide resolved
Copy link
Member

@arm4b arm4b left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work here 👍

Thanks a lot for the contribution!

@arm4b arm4b merged commit eee2803 into StackStorm:master Jan 19, 2021
@arm4b
Copy link
Member

arm4b commented Jan 19, 2021

@arms11 Following this major change, could you also update the official st2 documentation page for the stackstorm-ha here https://docs.stackstorm.com/install/k8s_ha.html ?

The documentation page source code is available at https://github.com/StackStorm/st2docs/blob/master/docs/source/install/k8s_ha.rst

@ytjohn
Copy link
Contributor

ytjohn commented Jan 19, 2021

@armab @arms11

This could not have come at a better time. A PKS/Kube host upgrade this weekend killed our etcd-cluster. After fighting half the day to get our obsolete stuff working again, I saw this merge in. Ran it a bit ago and it seems to be working perfect.

One thing for documentation/awareness. We upgraded from chart 0.32.0 to 0.51.0. So we still had etcd-operator. I had commented out the etcd-operator section for the first run. Everything came up, but just as the final st2-mongodb-2 was starting up, etcd-operator appeared and st2-redis instances vanished. Once etcd-operator was in play, the mongo and rabbitmq pods started restarting and everything went downhill.

For my second run, I explicitly set etcd-operator.enabled: false. The etcd-cluster & etc-operator pods went away, the new redis pods got created, and everything came up.

It's possible that this was some sort of hidden failure and helm automatically rolling back to the prior chart, but it wasn't clear. Since I have leftover bits related to etcd, and anyone else upgrading will, I recommend that they explicitly disable etcd-operator for a release or two. Maybe disable it before doing the upgrade to 0.51.0.

  redis:
    enabled: true
    cluster:
      slaveCount: 3
    sentinel:
      enabled: true
    networkPolicy:
      enabled: false
    usePassword: false
    metrics:
      enabled: false
  etcd-operator:
    enabled: false

@arms11 arms11 deleted the redis-as-coordination branch January 20, 2021 01:14
@arms11
Copy link
Contributor Author

arms11 commented Jan 20, 2021

@ytjohn - Point noted. Actually, this makes sense as you upgraded. Actually, this change should have been in the same PR where we moved from stable to bitnami as we had direct etcd cluster and not etcd-operator based resource. Thanks for bringing this up and I am glad this worked for you. Hopefully, we will also upgrade soon.
@armab - I must admit this test was not performed, and I believe we may have to consider adding this. Not sure how long we may have to keep it. Anyone upgrading from older version beyond the bitnami change could land in this behavior.
I will run a few tests, but I would like to get your opinion in order to do this best possible way.

@ytjohn
Copy link
Contributor

ytjohn commented Jan 20, 2021

@arms11 I just got done reviewing. It turns out that what happened was a helm timeout. Our organization has been running into docker hub rate limits and the deploy took over 10 minutes. After the timeout, helm attempts to roll back. That is what I was seeing. It had nothing to do with the etcd-operator key like I originally thought. In fact, I don't see anything in the helm chart that would act upon such a value anyways.

Everything else worked great.

@arms11
Copy link
Contributor Author

arms11 commented Jan 20, 2021

Thanks @ytjohn. Much appreciated the insight! That helps.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature size/M PR that changes 30-99 lines. Good size to review.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Can't set datastore key/value pairs in 0.50.0 'stable/etcd-operator' is not really stable for [coordination]
3 participants