Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fasterer liveness check #688

Merged
merged 2 commits into from Feb 24, 2021
Merged

Conversation

jrafanie
Copy link
Member

@jrafanie jrafanie commented Feb 24, 2021

Replace heartbeat_check.rb and friends with a simple shell script that assumes the heartbeat file now contains a timeout in unix time seconds since epoch.

Depends on the core change to write these files in unix time.

Preliminary results:

BEFORE

kasparov-1 environment running mostly idle

$ oc adm top pods
NAME                                 CPU(cores)   MEMORY(bytes)
1-event-handler-5dd956d75f-fr25h     74m          227Mi
1-generic-5865997d56-2xzrt           54m          322Mi
1-generic-5865997d56-fzwvz           72m          348Mi
1-priority-576657869b-9r4qf          71m          317Mi
1-priority-576657869b-swgdp          61m          297Mi
1-remote-console-5dff596449-ll2j5    35m          199Mi
1-reporting-755b878d7b-jpbzd         59m          189Mi
1-reporting-755b878d7b-vg7c6         59m          229Mi
1-schedule-dd5945764-vvfpb           78m          278Mi
1-ui-7654578488-t9s85                49m          361Mi
1-web-service-cdfc455b7-gbq5x        63m          336Mi
httpd-b89bbdff6-js9qv                1m           20Mi
manageiq-operator-769b4b48bb-dljm9   0m           30Mi
memcached-84987cbdf5-g42zs           0m           8Mi
orchestrator-fc86cbf66-vjzrj         21m          339Mi
postgresql-867fcb7d8-b65xn           9m           274Mi

AFTER

Master, mostly idle. Compare the CPU column with the same column above:

$ oc adm top pods
NAME                                 CPU(cores)   MEMORY(bytes)
1-event-handler-6fb6ff5995-drzw4     3m           177Mi
1-generic-7c87dc84f8-6xgxk           4m           228Mi
1-generic-7c87dc84f8-skp5p           3m           212Mi
1-priority-5df7c8b7f4-hl8jk          12m          228Mi
1-priority-5df7c8b7f4-rgfcl          12m          214Mi
1-remote-console-6f478b75b6-gxh7m    1m           186Mi
1-reporting-79799898-ct72b           5m           177Mi
1-reporting-79799898-knrmb           9m           177Mi
1-schedule-5d569cb977-qtrbr          2m           215Mi
1-ui-74f785fb99-w7q5j                2m           276Mi
1-web-service-5d47b5b5c8-tvjfv       8m           263Mi
httpd-b89bbdff6-tpp5r                0m           14Mi
manageiq-operator-57b78b6c6b-5jrjx   0m           23Mi
memcached-84987cbdf5-qxpjb           0m           1Mi
orchestrator-7f96f4964d-c8vng        16m          345Mi
postgresql-867fcb7d8-sz247           9m           196Mi

Below is a screenshot of a graph created by Thad Jennings and shows the vast difference of the idle CPU usage per worker/pod for appliances vs. podified before this change:

Screen Shot 2021-03-01 at 5 48 17 PM

hb_file="${APP_ROOT}/tmp/worker.hb"
if [[ -f ${hb_file} ]]; then
current_time=$(date +%s)
custom_timeout=$(cat ${hb_file})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why call it custom_timeout instead of heartbeat_time?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's sort of neither. The value inside is the custom threshold (specifically, Time.now.to_i + heartbeat_threshold)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe just call it threshold?

Copy link
Member Author

@jrafanie jrafanie Feb 24, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. It's because we store the current time + (the worker's timeout or the default timeout of 2 minutes). We're storing the "custom timeout", not the actual last heartbeat. It's also not the heartbeat timeout as that's a constant 2 minutes. We have a custom value that could be the current time + that 2 minutes or a timeout from that worker's settings.

Of note, we can't even see the worker heartbeat files from the "server" process in podified unless they're on the same filesystem, which isn't the case in pods. The server assumes the worker's are alive or dead based on the liveness check. If the worker row goes away, because it was killed due to not heartbeating, we won't update the worker's last heartbeat. If it does exist, we assume it's heartbeating and set the last heartbeat to the current time. See here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe just call it threshold?

I'm not really a fan of threshold since that's also used for memory_threshold and it's not really conveying a meaning. I think it's very much a custom timeout or maybe just a timeout. We overloaded terms with heartbeat timeout so we can't really use that. I'm open to other ideas.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the part that is confusing is comparing current_time to a timeout. The former feels like a particular time and the latter like a length of time (i.e. 5 minutes)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have agreed that not_responsive_threshold sounds better for the purposes of what the liveness check is doing.

@miq-bot
Copy link
Member

miq-bot commented Feb 24, 2021

Checked commits jrafanie/manageiq-pods@a5e10c6~...c1f5c0d with ruby 2.6.3, rubocop 0.82.0, haml-lint 0.35.0, and yamllint
0 files checked, 0 offenses detected
Everything looks fine. 🍪

@chessbyte
Copy link
Member

Preliminary results

@jrafanie very impressive performance improvements

@Fryguy
Copy link
Member

Fryguy commented Feb 24, 2021

@jrafanie Do you have a PR that rm's all of the original heartbeat check code or is that going to a follow up?

@Fryguy
Copy link
Member

Fryguy commented Feb 24, 2021

@jrafanie Super minor, but can we rename to not have manageiq in the name? NVM...I thought this was a new file as opposed to replacing the guts of the file.

@jrafanie
Copy link
Member Author

@jrafanie Do you have a PR that rm's all of the original heartbeat check code or is that going to a follow up?

Yes, will do a followup.

@jrafanie jrafanie marked this pull request as ready for review February 24, 2021 20:50
@jrafanie
Copy link
Member Author

Preliminary results

@jrafanie very impressive performance improvements

Thanks for the suggestion...this is a clear case where this needs to be fast and should be hyper optimized.

Copy link
Member

@bdunne bdunne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍🏻 LGTM

@Fryguy Fryguy merged commit 5609c81 into ManageIQ:master Feb 24, 2021
jrafanie added a commit to jrafanie/manageiq that referenced this pull request Feb 24, 2021
The only caller to hearbeat_check.rb was removed in:
ManageIQ/manageiq-pods#688

We switched to a simple bash script that relied on a core change
to make this less complicated:
ManageIQ#21079
@jrafanie jrafanie deleted the fasterer_liveness_check branch February 24, 2021 22:48
@Fryguy Fryguy self-assigned this Mar 4, 2021
@Fryguy
Copy link
Member

Fryguy commented Mar 5, 2021

Backported to lasker in commit 1c08ae7.

commit 1c08ae764e91a86549f1bdaf5cfe8486eb88f020
Author: Jason Frey <fryguy9@gmail.com>
Date:   Wed Feb 24 17:28:37 2021 -0500

    Merge pull request #688 from jrafanie/fasterer_liveness_check
    
    Fasterer liveness check
    
    (cherry picked from commit 5609c81c53d8a4264b37f6d2d38f5e7e0de2d436)

Fryguy added a commit that referenced this pull request Mar 5, 2021
Fasterer liveness check

(cherry picked from commit 5609c81)
jrafanie added a commit to jrafanie/manageiq that referenced this pull request Mar 10, 2021
The liveness probe used to be in ruby and did more than it needed:
loaded some of the ruby / activesupport libraries it didn't really need, parsed
time needlessly, parsed options that were always the same, etc.

It was rewritten below to be more performant so we can now request less cpu.

ManageIQ#21079
ManageIQ/manageiq-pods#688
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants