Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mgr/crash: raise warning about recent crashes and other improvements #29034

Merged
merged 22 commits into from Jul 24, 2019

Conversation

liewegas
Copy link
Member

  • raise health warning about recent crashes
  • 'ls-new' as well as 'ls' command
  • 'acrhive' and 'archive-all' commands
  • keep crashes in mgr memory
  • automatic pruning

@liewegas
Copy link
Member Author

@alfredodeza I had some battles with python 3 here, mainly with all of the comprehensions from dicts. IIUC, I have to do for key, value in mydict.items(), whereas in python2 the .items() wasn't needed. is that right?

@dmick
Copy link
Member

dmick commented Jul 15, 2019

I believe in Python2 "for x in d" would give you the keys, so the same as "for x in d.keys()". You always needed items() or iteritems() to get (k, v) pairs. iteritems() is the default in P3, so is now just items().

Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
@liewegas liewegas force-pushed the wip-crash-health branch 2 times, most recently from e7773d4 to cbedfbc Compare July 16, 2019 14:26
@alfredodeza
Copy link
Contributor

This is compatible with both Python versions (which Dan explains):

for key, value in dictionary.items()

doc/rados/operations/health-checks.rst Outdated Show resolved Hide resolved
doc/rados/operations/health-checks.rst Outdated Show resolved Hide resolved
src/pybind/mgr/crash/module.py Outdated Show resolved Hide resolved
src/pybind/mgr/crash/module.py Outdated Show resolved Hide resolved
src/pybind/mgr/crash/module.py Outdated Show resolved Hide resolved
src/pybind/mgr/crash/module.py Outdated Show resolved Hide resolved
src/pybind/mgr/crash/module.py Outdated Show resolved Hide resolved
@liewegas liewegas force-pushed the wip-crash-health branch 6 times, most recently from b28067b to 0619d26 Compare July 18, 2019 14:38
src/pybind/mgr/crash/module.py Outdated Show resolved Hide resolved
src/pybind/mgr/crash/module.py Outdated Show resolved Hide resolved
src/pybind/mgr/pg_autoscaler/module.py Show resolved Hide resolved
src/pybind/mgr/crash/module.py Show resolved Hide resolved
Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
@liewegas
Copy link
Member Author

@noahdesu i removed the telemetry test since the crash module is doing a better validity check on the crash dump.. that ok?

Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
@dotnwat
Copy link
Contributor

dotnwat commented Jul 22, 2019 via email

Signed-off-by: Sage Weil <sage@redhat.com>
@liewegas
Copy link
Member Author


The time period for what "recent" means is controlled by the option
``mgr/crash/warn_recent_interval`` (default: two weeks).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we also add retain_interval here?

Signed-off-by: Sage Weil <sage@redhat.com>
@liewegas
Copy link
Member Author

liewegas commented Jul 24, 2019 via email

@neha-ojha
Copy link
Member

I pushed another commit that adds it to the mgr/crash.rst file. I don't think it belongs in the health report section since it's unrelated to the report (or mitigating it).

looks good

@liewegas liewegas merged commit 1ea7570 into ceph:master Jul 24, 2019
liewegas added a commit that referenced this pull request Jul 24, 2019
* refs/pull/29034/head:
	doc/mgr/crash: document missing commands, options
	qa/suites/rados/singleton/all/test-crash: whitelist RECENT_CRASH
	qa/suites/rados/mgr/tasks/insights: whitelist RECENT_CRASH
	qa/tasks/mgr/test_insights: crash module now rejects bad crash reports
	mgr/telemetry: fix remote into crash do_ls()
	mgr/crash: don't make these methods static
	mgr/BaseMgrModule: handle unicode health detail strings
	mgr/crash: verify timestamp is valid
	qa/suites/mgr: whitelist RECENT_CRASH
	mgr/crash: remove unused var
	mgr/crash: remove unused import 'six'
	qa/workunits/rados/test_crash: health check
	mgr/crash: improve validation on post
	mgr/crash: automatically prune old crashes after a year
	mgr/crash: raise RECENT_CRASH warning for recent (new) crashes
	mgr/crash: add 'crash ls-new'
	mgr/crash: add option and serve infra
	mgr/crash: keep copy of crashes in memory
	mgr/pg_autoscaler: adjust style to match built-in tables
	mgr/crash: make 'crash ls' a nice table with a NEW column
	mgr/crash: nicely format 'crash info' output
	mgr/crash: add 'crash archive <id>', 'crash archive-all' commands

Reviewed-by: Neha Ojha <nojha@redhat.com>
@liewegas liewegas deleted the wip-crash-health branch July 24, 2019 23:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants