Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stage.0: Module function cephprocesses.wait threw an exception. Exception: 'openattic' #657

Closed
Martin-Weiss opened this issue Sep 20, 2017 · 14 comments

Comments

@Martin-Weiss
Copy link

On an already deployed cluster I get this error when executing stage.0:

  Name: wait until the cluster has recovered before processing ses-5-single.emea.utopia.novell.com - Function: salt.state - Result: Changed Started: - 09:08:20.869260 Duration: 6408.758 ms
----------
          ID: check if all processes are still running after processing ses-5-single.emea.utopia.novell.com
    Function: salt.state
      Result: False
     Comment: Run failed on minions: ses-5-single.emea.utopia.novell.com
              Failures:
                  ses-5-single.emea.utopia.novell.com:
                  ----------
                            ID: wait processes
                      Function: module.run
                          Name: cephprocesses.wait
                        Result: False
                       Comment: Module function cephprocesses.wait threw an exception. Exception: 'openattic'
                       Started: 09:08:27.525172
                      Duration: 83.194 ms
                       Changes:

                  Summary for ses-5-single.emea.utopia.novell.com
                  ------------
                  Succeeded: 0
                  Failed:    1
                  ------------
                  Total states run:     1
                  Total run time:  83.194 ms
     Started: 09:08:27.278118
    Duration: 350.72 ms
     Changes:

Summary for ses-5-single.emea.utopia.novell.com_master
-------------
Succeeded: 14 (changed=8)
Failed:     1
-------------

The openattic.service us up an running well and I can access openATTIC with a web-browser without any problems.

Any idea what might be wrong, here?

@swiftgist
Copy link
Contributor

Did your cluster go into HEALTH_ERR? The steps in Stage 0 for a minion are serialized for an already running cluster. It's in /srv/salt/ceph/stage/0/minion/default.sls. The ceph.wait state is simply paranoia on our part that the previous update on some minion caused an issue and the cluster did not recover. We bail out.

Unfortunately, the HEALTH_ERR status isn't terribly granular so we do not have a systematic guarantee of correlating cause and effect.

@Martin-Weiss
Copy link
Author

Did your cluster go into HEALTH_ERR?

No - the cluster and all services are up and running well and I am testing to go through stages 0..5 where I do not expect failures in case I did not change anything.

Any idea how to get more debug information, here?

@jschmid1
Copy link
Contributor

jschmid1 commented Sep 20, 2017

@Martin-Weiss

There is a module called cephprocesses.py which checks for services to be up.

You can try it either on the respective node with:

salt-call cephprocesses.check

or target the node directly

salt '$thenode' cephprocesses.check

or condensed in a runner that checks all services for all roles on all nodes.

salt-run cephprocesses.check

Appending a -l debug might give us more insight.

@Martin-Weiss
Copy link
Author

salt-call cephprocesses.check

result:

ses-5-single:~ # salt-call cephprocesses.check
[ERROR   ] An un-handled exception was caught by salt's global exception handler:
KeyError: 'openattic'
Traceback (most recent call last):
  File "/usr/bin/salt-call", line 11, in <module>
    salt_call()
  File "/usr/lib/python2.7/site-packages/salt/scripts.py", line 379, in salt_call
    client.run()
  File "/usr/lib/python2.7/site-packages/salt/cli/call.py", line 58, in run
    caller.run()
  File "/usr/lib/python2.7/site-packages/salt/cli/caller.py", line 134, in run
    ret = self.call()
  File "/usr/lib/python2.7/site-packages/salt/cli/caller.py", line 197, in call
    ret['return'] = func(*args, **kwargs)
  File "/var/cache/salt/minion/extmods/modules/cephprocesses.py", line 36, in check
    for process in processes[role]:
KeyError: 'openattic'
Traceback (most recent call last):
  File "/usr/bin/salt-call", line 11, in <module>
    salt_call()
  File "/usr/lib/python2.7/site-packages/salt/scripts.py", line 379, in salt_call
    client.run()
  File "/usr/lib/python2.7/site-packages/salt/cli/call.py", line 58, in run
    caller.run()
  File "/usr/lib/python2.7/site-packages/salt/cli/caller.py", line 134, in run
    ret = self.call()
  File "/usr/lib/python2.7/site-packages/salt/cli/caller.py", line 197, in call
    ret['return'] = func(*args, **kwargs)
  File "/var/cache/salt/minion/extmods/modules/cephprocesses.py", line 36, in check
    for process in processes[role]:
KeyError: 'openattic'

on the node itself (this is on the admin node)


ses-5-single:~ # salt 'ses-5-single*' cephprocesses.check
ses-5-single.emea.utopia.novell.com:
    The minion function caused an exception: Traceback (most recent call last):
      File "/usr/lib/python2.7/site-packages/salt/minion.py", line 1445, in _thread_return
        return_data = executor.execute()
      File "/usr/lib/python2.7/site-packages/salt/executors/direct_call.py", line 28, in execute
        return self.func(*self.args, **self.kwargs)
      File "/var/cache/salt/minion/extmods/modules/cephprocesses.py", line 36, in check
        for process in processes[role]:
    KeyError: 'openattic'
ses-5-single:~ # salt-run cephprocesses.check
True

Output of salt-call cephprocesses.check attached:

cephprocess-debug.txt

@jschmid1
Copy link
Contributor

for process in processes[role]:
KeyError: 'openattic'

this confirms that #661 will fix your issue

@Martin-Weiss
Copy link
Author

Manually applied the change in #661 and executed salt "*" saltutil.sync_all.

--> After this the error is gone! THANKS!

@khodayard
Copy link

I'm having same issue with prometheus:

salt cl5.opn.shft cephprocesses.check

cl5.opn.shft:
The minion function caused an exception: Traceback (most recent call last):
File "/usr/lib/python2.7/site-packages/salt/minion.py", line 1445, in _thread_return
return_data = executor.execute()
File "/usr/lib/python2.7/site-packages/salt/executors/direct_call.py", line 28, in execute
return self.func(*self.args, **self.kwargs)
File "/var/cache/salt/minion/extmods/modules/cephprocesses.py", line 89, in check
if pdict_exe in processes[role] or pdict_name in processes[role]:
KeyError: 'prometheus

@jschmid1
Copy link
Contributor

jschmid1 commented Apr 3, 2019

Which deepsea version are you using @khodayard ?

@khodayard
Copy link

Versions Report
cl5:~ # salt-run deepsea.version
0.8.9+git.0.c638bee79
cl5:~ # rpm -qi salt-minion
Name : salt-minion
Version : 2016.11.4
Release : 8.1
Architecture: x86_64
Install Date: Mon Mar 25 12:31:18 2019
Group : System/Management
Size : 37807
License : Apache-2.0
Signature : RSA/SHA256, Mon Aug 7 15:31:24 2017, Key ID b88b2fd43dbdc284
Source RPM : salt-2016.11.4-8.1.src.rpm
Build Date : Mon Aug 7 15:30:15 2017
Build Host : cloud125
Relocations : (not relocatable)
Packager : http://bugs.opensuse.org
Vendor : openSUSE
URL : http://saltstack.org/
Summary : The client component for Saltstack
Description :
Salt minion is queried and controlled from the master.
Listens to the salt master and execute the commands.
Distribution: openSUSE Leap 42.3
cl5:~ # rpm -qi salt-master
Name : salt-master
Version : 2016.11.4
Release : 8.1
Architecture: x86_64
Install Date: Mon Mar 25 12:31:18 2019
Group : System/Management
Size : 1662854
License : Apache-2.0
Signature : RSA/SHA256, Mon Aug 7 15:31:24 2017, Key ID b88b2fd43dbdc284
Source RPM : salt-2016.11.4-8.1.src.rpm
Build Date : Mon Aug 7 15:30:15 2017
Build Host : cloud125
Relocations : (not relocatable)
Packager : http://bugs.opensuse.org
Vendor : openSUSE
URL : http://saltstack.org/
Summary : The management component of Saltstack with zmq protocol supported
Description :
The Salt master is the central server to which all minions connect.
Enabled commands to remote systems to be called in parallel rather
than serially.
Distribution: openSUSE Leap 42.3
cl5:~ #

@jschmid1
Copy link
Contributor

jschmid1 commented Apr 4, 2019

@khodayard There is no role-grafana or role-prometheus in 0.8.x yet. If you just remove that entry from the policy.cfg DeepSea will deploy your monitoring stack on the master.

@khodayard
Copy link

@jschmid1 thank you for your response. this is my policy.cfg now:

:~ # cat /srv/pillar/ceph/proposals/policy.cfg
role-master/cluster/cl5.opn.shft.sls
role-admin/cluster/.sls
cluster-ceph/cluster/
.sls
role-mon/cluster/.sls
role-mgr/cluster/
.sls
role-mds/cluster/.sls
role-igw/cluster/
.sls
role-rgw/cluster/.sls
role-ganesha/cluster/
.sls
role-openattic/cluster/.sls
config/stack/default/global.yml
config/stack/default/ceph/cluster.yml
profile-default/cluster/
.sls
profile-default/stack/default/ceph/minions/*.yml

but I'm getting the same result:
Ended stage: ceph.stage.0 succeeded=14/42 failed=2/42 time=76.4s

Failures summary:

ceph.metapackage (/srv/salt/ceph/metapackage):
cl5.opn.shft:
ceph.processes (/srv/salt/ceph/processes):
cl5.opn.shft:
wait for all processes: Module function cephprocesses.wait threw an exception. Exception: 'prometheus'

I've even tried to upgrade deepsea to the latest version from github but it failed and I had to revert a snapshot:

:~ # deepsea stage run ceph.stage.0
Traceback (most recent call last):
File "/usr/bin/deepsea", line 9, in
load_entry_point('deepsea==0.9.16+24.g715e0713', 'console_scripts', 'deepsea')()
File "/usr/lib/python3.4/site-packages/pkg_resources/init.py", line 558, in load_entry_point
return get_distribution(dist).load_entry_point(group, name)
File "/usr/lib/python3.4/site-packages/pkg_resources/init.py", line 2682, in load_entry_point
return ep.load()
File "/usr/lib/python3.4/site-packages/pkg_resources/init.py", line 2355, in load
return self.resolve()
File "/usr/lib/python3.4/site-packages/pkg_resources/init.py", line 2361, in resolve
module = import(self.module_name, fromlist=['name'], level=0)
File "/usr/lib/python3.4/site-packages/deepsea/main.py", line 10, in
from .deepsea import main
File "/usr/lib/python3.4/site-packages/deepsea/deepsea.py", line 22, in
from .monitor import Monitor
File "/usr/lib/python3.4/site-packages/deepsea/monitor.py", line 17, in
from .salt_event import SaltEventProcessor
File "/usr/lib/python3.4/site-packages/deepsea/salt_event.py", line 11, in
import salt.config
ImportError: No module named 'salt'

Thanks again.

@khodayard
Copy link

@jschmid1 Would you please take a look at #1599 ? that's my main problem that I'm trying to fix using this workaround :)

@jschmid1
Copy link
Contributor

jschmid1 commented Apr 5, 2019

Make sure to run stage.2 after changing the policy.cfg

@khodayard
Copy link

Running stage.2 fixed that problem, thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants