Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement OOM score adjustment #435

Closed
wants to merge 1 commit into from

Conversation

whitslack
Copy link
Contributor

@whitslack whitslack commented Aug 10, 2021

This PR adds a new --oom-score-adj option to start-stop-daemon and supervise-daemon to complement the existing --nicelevel and --ionice options. Often it is advantageous to depress the OOM score of critical system services so that the kernel's OOM killer will prefer to kill more "disposable" processes in OOM scenarios. SSD and SD previously lacked a clean way to adjust the OOM scores of the daemons they start. With this patchset, now the oom_score_adj of launched daemons can be specified either by the new command-line option or by a new SSD_OOMSCOREADJ environment variable, which is akin to SSD_NICELEVEL and SSD_IONICELEVEL.

Note: While crafting this enhancement, I noticed that supervise-daemon was missing support for the SSD_IONICELEVEL environment variable, as though it had been overlooked when such support was added to start-stop-daemon, so I took the opportunity to fill in the missing implementation (in a separate commit, which precedes the commit that holds the real meat of this PR).

@vapier
Copy link
Member

vapier commented Aug 10, 2021

should send the ionice change as a sep PR so we can get that in independently (i know it's a sep commit, but GH doesn't make it easy to merge PR's partially)

imo we should avoid short options unless there's a commonish meaning behind it. for example, -h is very often short for --help, so that's OK. but i don't know that anyone uses -O for --oom-score-adj. do you have references to any prior art here ? a ton of short inscrutable short options makes for command lines & init scripts that need a decoder ring to understand. saving ~10 bytes now wastes everyone's time later. i can guarantee no one will ever remember the meanings of every single ssd short option.

@whitslack
Copy link
Contributor Author

Fine by me. I hate short options, especially in scripts where their meaning is never clear when I go back to look at the script I wrote even a week later. I only defined a short option because every other option had a short form. If you're happy to break that pattern, I'm happy to remove -O.

As for splitting into two PRs, how should I do that? The PR to add the --oom-score-adj option will be dependent on the PR to fill in the SSD_IONICELEVEL environment variable support. Should the second PR target the branch of the first PR rather than targeting master?

Regarding merging, you're aware that you can merge PRs using the command line, right? GitHub will pick up the merge and reflect it properly on the web. I think if you merge only the first commit of this PR (which you could/should do as a FF, in my opinion), then GitHub will leave this PR open.

@williamh
Copy link
Contributor

Hi @whitslack, Yes, both prs should target master and one will have to be rebased after the other is merged.

I do know that you can merge on the command line, but you get the full pr still when you do that.

I agree also with @vapier that we should move away from short options. If I have my way I would deprecate short options where possible.
I think that is going to take rewriting the command line parser a bit since we use getopt_long() which seems to require both long and short options unless I'm missing something.

@whitslack
Copy link
Contributor Author

I do know that you can merge on the command line, but you get the full pr still when you do that.

@williamh: I'm confused. You don't have to merge from the tip of the PR branch.

$ git fetch origin
$ git checkout origin/master
$ git fetch origin pull/435/head
$ git merge --no-ff FETCH_HEAD^{/SSD_IONICELEVEL}
$ git push origin -- @:master

What are you saying GitHub does in that case? Admittedly I've never done it, but my expectation is that the PR would be left open and would remain mergeable with no conflicts.


I think that is going to take rewriting the command line parser a bit since we use getopt_long() which seems to require both long and short options unless I'm missing something.

No, that's not necessary. You just leave the short option out of the optstring and choose some other value for the long option's val. The usage printer in _usage.c omits printing a short option if val is not a printable character.

@vapier
Copy link
Member

vapier commented Aug 10, 2021

i'm not anti all short options, just using short options that are not common and/or obvious.

getopt_long() does not require short options. the val is simply what getopt_long returns so you can switch on it. the only short options are what is passed in optstring. for long-only options, we can simply use values greater than 127 to avoid conflict with the ASCII space. personally i've started using 0x1000 as the long-only base so it's less confusing to readers.

@whitslack
Copy link
Contributor Author

whitslack commented Aug 10, 2021

@vapier: I used val=1 for the long-only --oom-score-adj option. Would you prefer 0x1000?

Note that 0x1000 is going to make your switch table much sparser, which may not be as good for code size.

@vapier
Copy link
Member

vapier commented Aug 10, 2021

def do not use values <127. it's confusing to read, will be even more confusing once we exhaust the limited non-printable ASCII space, and can be easy to introduce collisions if you're not careful (since they're all ints, 32 and ' ' will be the same, etc...).

i would actually look at codegen in -O2/-Os levels before deciding between 0x1000 & 0x80 base. i suspect it won't be that big of a deal, but i don't care enough at that point to bikeshed :p.

etc/rc.conf Outdated
@@ -120,6 +120,8 @@
# Or the ionice level. The format is class[:data] , just like the
# --ionice start-stop-daemon parameter.
#SSD_IONICELEVEL="2:2"
# Or the OOM score adjustment.
#SSD_OOMSCOREADJ="-1000"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is -1000 a good default to leave here ? it basically means "never allow OOM to kill this". if you want to keep that, it'd be worth a comment, and a ref to the appropriate man page for more details.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've used -100 for my own critical services. Would you find that a more palatable suggestion?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm not saying -1000 is never a good idea, but i strongly doubt that it should be the suggested default that people get by uncommenting a single line. the number of services that you'd want using this on any given system would usually be extremely small (probably like <5).

plus, config files with commented out lines are supposed to represent the actual default rather than some other random value. so here it'd be better to use #SSD_OOMSCOREADJ="0" (which i think is the default behavior, but i don't have a way of easily checking), and then leave a comment pointing people to more extensive documentation which would cover valid ranges, as well as the underlying meaning & implications of these settings, as well as some recommended conventions.

this kind of thing is why i suggested sep PR's for sep issues ;).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this kind of thing is why i suggested sep PR's for sep issues ;).

This kind of thing is why I put the bugfix first and the controversial stuff subsequent, so the bugfix could be merged while we continue to hash out the details of the enhancement. ;)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that the default should be the value in the commented-out example config option, but that's not the case for the other options currently. (The default nice level is not -19, and the default I/O nice is not 2:2.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can/should fix the other bad ones ;).

i think there's a few reasonable methods for config files:

  1. the commented value is the default
  2. the inline docs clearly mention that the value is changing behavior from the default
  3. provide both so the default is clear, and can list one or two common alternatives that users would want without having to refer to extensive manual pages

since our configs have largely been doing (1), we should favor consistency here and just switch them all to that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this kind of thing is why i suggested sep PR's for sep issues ;).

This kind of thing is why I put the bugfix first and the controversial stuff subsequent, so the bugfix could be merged while we continue to hash out the details of the enhancement. ;)

this is possible if doing manual git operations from the CLI, but afaik, not possible at all via the web interfaces.

i'm comfortable with manipulating/extracting git objects, but i don't think it's common, and even then, GH doesn't exactly make it an easy process. figuring out the exact remote & ref to manually fetch is tedious.

a sep PR makes it a lot easier to just smash the merge button. basically GH is tailored for flows where a single PR maps to a single "feature", and not tailored for unrelated things. ignoring the "here's a grab bag of fixes that everyone agrees on" type of thing.

my main desktop had hardware failures recently, so i've (temporarily) lost easy access to do any manual surgery from the CLI.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. the commented value is the default

since our configs have largely been doing (1), we should favor consistency here and just switch them all to that.

The slight gotcha here is that setting SSD_NICELEVEL="0" will not actually cause start-stop-daemon and supervise-daemon to set the priority of the daemon process to 0. Rather, no change in priority will be applied. The same goes for the new SSD_OOMSCOREADJ variable: setting it to 0 is equivalent to leaving it unspecified: no change to the value will be applied. Granted, the expectation is that start-stop-daemon will be exec'd in a process that already has priority 0 and OOM score adjustment 0, but that's not necessarily the case. I'm not sure that the commented-out suggestions for these variables should have values that are no-ops.

Perhaps I ought to choose an out-of-range value as the sentinel ("unset") value so that explicitly specifying 0 for these options will cause the priority/oom_score_adj value to be affirmatively set to 0. Then the commented-out suggestions in the example config file could all meaningfully have 0 values.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps I ought to choose an out-of-range value as the sentinel ("unset") value so that explicitly specifying 0 for these options will cause the priority/oom_score_adj value to be affirmatively set to 0. Then the commented-out suggestions in the example config file could all meaningfully have 0 values.

3317c9b

@whitslack
Copy link
Contributor Author

def do not use values <127.

Okay. I'll change it.

0x80 base

Are you certain that isprint(0x80) actually returns false? For that matter, are you certain that isprint(0x1000) actually returns false in all locales?

@whitslack
Copy link
Contributor Author

i would actually look at codegen in -O2/-Os levels

  • With any long-only opts ≥0x1000, the compiler generates an initial range check for the switch to divide the option value space into a low range and a high range. The low range is handled by a jump table.
  • With four or fewer long-only opts starting at 0x1000, the high range is handled by a series of comparisons and conditional branches.
  • With five or more long-only opts starting at 0x1000, the compiler generates a second jump table to handle the high range.
  • With long-only opts starting at 0x1, the compiler generates a single jump table to handle all of the option values.
  • With long-only opts starting at 0x81, the compiler also generates a single jump table to handle all of the option values.

I also don't care to bikeshed. I'm just supporting my argument that making a sparse switch is less code-efficient. I'll implement whatever you say.

@vapier
Copy link
Member

vapier commented Aug 10, 2021

Are you certain that isprint......

i think you're missing some aspects of how locales work in POSIX runtimes. nowhere in openrc do we call setlocale(). that means all our C code only executes in the POSIX locale (i.e. the C locale). hence the only codeset we care about here is ASCII. hence isprint(0x80) and every other value >=0x7f will return false.

that said, when i added the isprint() to the usage code, it was copying a pattern i used in other Gentoo projects when i was using the random non-printable ranges (e.g. 0x01 - 0x1f) due to a misunderstanding of how getopt_long actually worked. i've learned since through experience that it's a bad pattern, and i haven't circled back to updating the openrc code accordingly.

so only use <0x80 values in getopts_long when it's actual printable characters. if we're opting the short option, use >=0x80.

@whitslack
Copy link
Contributor Author

nowhere in openrc do we call setlocale().

That may be true now. I'd argue it's better to avoid the fragility of assuming anything about running in any particular locale, even the POSIX one. You don't want to tie your hands for the future in case you ever want to implement something where you'd want to depend on the user's preferred locale.

so only use <0x80 values in getopts_long when it's actual printable characters. if we're opting the short option, use >=0x80.

Okay, I'll base the long-only option value at 0x80.

@vapier
Copy link
Member

vapier commented Aug 11, 2021

making assumptions about POSIX locale behavior is the entire point of the POSIX locale. it's an exact minimal definition, and anything that deviates from it is horribly broken and deserves to blow up and is not our problem. it'd be like saying you can't make assumptions about ASCII codepoints.

i'm not tied to the use of isprint in the usage code as i alluded earlier. switching to a hardcoded 0x80 test would be fine. which would really be replacing one set of assumptions with another (in this case, encoding ASCII codepoint assumptions at compile time about C compiler behavior).

but i'll also note that writing locale portable code is a fun challenge by itself, and this particular aspect would hardly be the biggest issue. i'll wager that we'll never need to support locale in our tools directly, and even if we wanted to, doing anything other than UTF-8 would be a waste of time, and supporting non-ASCII CLI flags would be a terrible idea.

@@ -790,6 +803,15 @@ int main(int argc, char **argv)
eerrorx("%s: ioprio_set %d %d: %s", applet,
ionicec, ioniced, strerror(errno));

if (oom_score_adj != 0) {
fp = fopen("/proc/self/oom_score_adj", "w");
if (! fp)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

omit the space after the !

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I change the other occurrence(s) of this at the same time? It's not my style; I was mimicking what's already there.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, if you want to adjust that by itself, sounds fine

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whitslack added a commit to whitslack/openrc that referenced this pull request Aug 17, 2021
There are no semantic changes in this commit.

Suggested-by: Mike Frysinger <vapier@gentoo.org>
See: OpenRC#435 (review)
whitslack added a commit to whitslack/openrc that referenced this pull request Aug 17, 2021
…_adj 0

Previously, specifying --nicelevel=0 or --oom-score-adj=0 was
equivalent to not specifying the option. This meant that there was no
way to set the nice level or OOM score adjustment of the launched
daemon process affirmatively to 0. This commit changes the sentinel
values, indicating an unspecified option, from 0 to an out-of-range
value (INT_MIN) so that specifying an option value as 0 will actually
cause the value 0 to be applied to the corresponding process knob.

Additionally, per a suggestion by Mike Frysinger, the suggested values
for the SSD_NICELEVEL, SSD_IONICELEVEL, and SSD_OOMSCOREADJ variables
in the example config file are now given as zeros, which are the
kernel's default values of these process knobs for the init process at
boot. Note that uncommenting any of these zero-valued suggestions will
cause SSD/SD to set the corresponding process knob affirmatively to
zero, whereas leaving the variable unset (and the equivalent command-
line option unspecified) means SSD/SD will not change the corresponding
process knob from its inherited value.

See: OpenRC#435 (comment)
@williamh
Copy link
Contributor

@whitslack Please rebase this on current master and I'll take a look.

whitslack added a commit to whitslack/openrc that referenced this pull request Sep 14, 2021
This commit adds a new --oom-score-adj option to start-stop-daemon and
supervise-daemon, as well as an equivalent SSD_OOMSCOREADJ environment
variable. If either of these are specified (with the command-line
option taking precedence), then the specified adjustment value is
written to /proc/self/oom_score_adj after forking but prior to exec'ing
the daemon (at the time when nice and ionice are applied).

Additionally, per a suggestion by Mike Frysinger, the suggested values
for the SSD_NICELEVEL, SSD_IONICELEVEL, and SSD_OOMSCOREADJ variables
in the example config file are now given as zeros, which are the
kernel's default values of these process knobs for the init process at
boot. Note that uncommenting any of these zero-valued suggestions will
cause SSD/SD to set the corresponding process knob affirmatively to
zero, whereas leaving the variable unset (and the equivalent command-
line option unspecified) means SSD/SD will not change the corresponding
process knob from its inherited value.

See: OpenRC#435 (comment)
whitslack added a commit to whitslack/openrc that referenced this pull request Sep 14, 2021
There are no semantic changes in this commit.

Suggested-by: Mike Frysinger <vapier@gentoo.org>
See: OpenRC#435 (review)
@whitslack
Copy link
Contributor Author

@williamh: Rebased (and squashed) as requested.

@williamh
Copy link
Contributor

@whitslack All of the code style adjustments in this pr make it hard to review. Are you willing to fix the pr to only contain the code for the oom score adjustment implementation?

@whitslack
Copy link
Contributor Author

All of the code style adjustments in this pr make it hard to review. Are you willing to fix the pr to only contain the code for the oom score adjustment implementation?

@williamh: 😅 That's how it was to start, but then @vapier asked for a code style adjustment. I put all the code style changes in their own commit with no other changes, so if you're looking to review only the OOM score changes, they're all in d779c5e.

vapier pushed a commit that referenced this pull request Dec 21, 2021
There are no semantic changes in this commit.

Suggested-by: Mike Frysinger <vapier@gentoo.org>
See: #435 (review)
@vapier
Copy link
Member

vapier commented Dec 21, 2021

i peeled the style change out and merged it by itself. can you rebase please ?

@williamh
Copy link
Contributor

I just did the same with the error message fix; it is merged into master.
I have an issue that no one else would have caught, and I agree it is specific to me, but thoughts would be appreciated.
While reviewing the code, I first thought the name of the environment variable you added was SSD_OOMSTORAGE, but it turns out that my screen reader pronounces SSD_OOMSCOREADJ in a way that sounds extremely similar to SSD_OOMSTORAGE.
So we need a new name for that environment variable.
I'm thinking either SSD_OOM_SCORE_ADJ or SSD_OOM_ADJ. Does anyone have any suggestions or preferences?

@vapier
Copy link
Member

vapier commented Dec 22, 2021

SSD_OOM_SCORE_ADJ is preferable since it matches the /proc/self/oom_score_adj name

@williamh
Copy link
Contributor

@vapier Ok, that's what I thought, I'm working on that locally.

This commit adds a new --oom-score-adj option to start-stop-daemon and
supervise-daemon, as well as an equivalent SSD_OOMSCOREADJ environment
variable. If either of these are specified (with the command-line
option taking precedence), then the specified adjustment value is
written to /proc/self/oom_score_adj after forking but prior to exec'ing
the daemon (at the time when nice and ionice are applied).

Additionally, per a suggestion by Mike Frysinger, the suggested values
for the SSD_NICELEVEL, SSD_IONICELEVEL, and SSD_OOMSCOREADJ variables
in the example config file are now given as zeros, which are the
kernel's default values of these process knobs for the init process at
boot. Note that uncommenting any of these zero-valued suggestions will
cause SSD/SD to set the corresponding process knob affirmatively to
zero, whereas leaving the variable unset (and the equivalent command-
line option unspecified) means SSD/SD will not change the corresponding
process knob from its inherited value.

See: OpenRC#435 (comment)

code style: remove space after unary "not" operator

There are no semantic changes in this commit.

Suggested-by: Mike Frysinger <vapier@gentoo.org>
See: OpenRC#435 (review)
@whitslack
Copy link
Contributor Author

SSD_OOM_SCORE_ADJ is preferable since it matches the /proc/self/oom_score_adj name

If you make that change, then you should also change SSD_NICELEVELSSD_NICE_LEVEL and SSD_IONICELEVELSSD_IO_NICE_LEVEL for consistency. (You'd probably want to preserve as fallbacks the previous names of the existing variables for backward compatibility.)

@vapier
Copy link
Member

vapier commented Dec 22, 2021

to be clear, the nice settings don't have corresponding files whose names would be matched to. but i don't disagree with using _ to break up the words.

@williamh williamh closed this in fd1e4a3 Dec 22, 2021
@whitslack whitslack deleted the oom-score-adj branch January 4, 2022 09:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants