New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ceph-disk: enable --runtime ceph-osd systemd units #12241
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
systemd/ceph-osd@.service
@@ -19,6 +19,7 @@ TasksMax=infinity | |||
Restart=on-failure | |||
StartLimitInterval=30min | |||
StartLimitBurst=3 | |||
RestartSec=5min |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe 5 minutes delay is too long in production env?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@idealguo maybe it is. Do you have a specific scenario in mind ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see the default value is 100ms (https://www.freedesktop.org/software/systemd/man/systemd.service.html)
and the "osd_heartbeat_grace" is 20s by default, may be 20s or around is an option
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the scenario I have in mind, at boot time, ceph-disk@.service may happen 5 to 10 minutes after ceph-osd@.service attempted to run. There is no doubt that waiting minutes is useful. What I'm not 100% sure yet is if this can cause problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My concern is when the osd service exit accidently, such as: killed by someone. If we wait 5mins before restart, it will be too long
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you're right. What about
RestartSec=20s StartLimitBurst=30
That will account for both cases. At boot time it will retry every 20s for about 10 minutes. And at runtime it will restart within 20 seconds with no risk for the OSD to be marked down. I don't think anything at runtime will behave differently if the OSD restarts after 20 seconds instead of restarting after 100ms. What do you think ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, 100ms is too sensitive, it will reach "StartLimitBurst" limitation quickly
@dachary Would "ceph-disk: do not enable ceph-osd systemd units" be a better title for this PR (and for the second commit)? |
@smithfarm indeed, thanks :-) |
Instead of the default 100ms pause before trying to restart an OSD, wait 20 seconds instead and retry 30 times instead of 3. There is no scenario in which restarting an OSD almost immediately after it failed would get a better result. It is possible that a failure to start is due to a race with another systemd unit at boot time. For instance if ceph-disk@.service is delayed, it may start after the OSD that needs it. A long pause may give the racing service enough time to complete and the next attempt to start the OSD may succeed. This is not a sound alternative to resolve a race, it only makes the OSD boot process less sensitive. In the example above, the proper fix is to enable --runtime ceph-osd@.service so that it cannot race at boot time. The wait delay should not be minutes to preserve the current runtime behavior. For instance, if an OSD is killed or fails and restarts after 10 minutes, it will be marked down by the ceph cluster. This is not a change that could break things but it is significant and should be avoided. Refs: http://tracker.ceph.com/issues/17889 Signed-off-by: Loic Dachary <loic@dachary.org>
@liewegas I'll figure something out to fix existing installations on upgrade. |
If ceph-osd@.service is enabled for a given device (say /dev/sdb1 for osd.3) the ceph-osd@3.service will race with ceph-disk@dev-sdb1.service at boot time. Enabling ceph-osd@3.service is not necessary at boot time because ceph-disk@dev-sdb1.service calls ceph-disk activate /dev/sdb1 which calls systemctl start ceph-osd@3 The systemctl enable/disable ceph-osd@.service called by ceph-disk activate is changed to add the --runtime option so that ceph-osd units are lost after a reboot. They are recreated when ceph-disk activate is called at boot time so that: systemctl stop ceph knows which ceph-osd@.service to stop when a script or sysadmin wants to stop all ceph services. Before enabling ceph-osd@.service (that happens at every boot time), make sure the permanent enablement in /etc/systemd is removed so that only the one added by systemctl enable --runtime in /run/systemd remains. This is useful to upgrade an existing cluster without creating a situation that is even worse than before because ceph-disk@.service races against two ceph-osd@.service (one in /etc/systemd and one in /run/systemd). Fixes: http://tracker.ceph.com/issues/17889 Signed-off-by: Loic Dachary <loic@dachary.org>
[ | ||
'systemctl', | ||
'disable', | ||
'ceph-osd@{osd_id}'.format(osd_id=osd_id), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any reason why trailing ".service" was omitted? This is the default, but we don't have 100% certainty that it will always be this way, and IMO it's better to specify the full form of the unit name instead of relying on systemd to fill in the blank.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unless you think that's important and likely to happen, I'd rather have that in a separate commit for cleanup to keep this one minimal. I'm under the impression that systemd will keep that naming convetion to avoid backward incompatible problems.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
of course, it's not likely to happen and that kind of change falls into the "cleanup" category.
jenkins test this please (test_objectstore_memstore.sh) |
teuthology-suite -k distro --verbose --suite ceph-disk --suite-branch master --ceph wip-17889-systemd-order --machine-type vps --priority 101 machine_types/vps.yaml ~/shaman.yaml |
ceph-disk is responsable for enabling the unit file if needed. Actually since ceph/ceph#12241 it seems that it's not even needed. On an event of a restart, udev rules will be trigger and they will ceph-disk activate the device too so the 'enabled' is not needed. Closes: #1142 Signed-off-by: Sébastien Han <seb@redhat.com>
ceph-disk is responsable for enabling the unit file if needed. Actually since ceph/ceph#12241 it seems that it's not even needed. On an event of a restart, udev rules will be trigger and they will ceph-disk activate the device too so the 'enabled' is not needed. Closes: #1142 Signed-off-by: Sébastien Han <seb@redhat.com>
http://tracker.ceph.com/issues/17889