Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix multiple qmemman issues #331

Merged
merged 1 commit into from Apr 10, 2020
Merged

Conversation

marmarek
Copy link
Member

@marmarek marmarek commented Apr 1, 2020

First the main bug: when meminfo xenstore watch fires, in some cases
(just after starting some domain) XS_Watcher refreshes internal list of
domains before processing the event. This is done specifically to
include new domain in there. But the opposite could happen too - the
domain could be destroyed. In this case refres_meminfo() function raises
an exception, which isn't handled and interrupts the whole xenstore
watch loop. This issue is likely to be triggered by killing the domain,
as this way it could disappear shortly after writing updated meminfo
entry. In case of proper shutdown, meminfo-writer is stopped earlier and
do not write updates just before domain destroy.
Fix this by checking if the requested domain is still there just after
refreshing the list.

Then, catch exceptions in xenstore watch handling functions, to not
interrupt xenstore watch loop. If it gets interrupted, qmemman basically
stops memory balancing.

And finally, clear force_refresh_domain_list flag after refreshing the
domain list. That missing line caused domain refresh at every meminfo
change, making it use some more CPU time.

Thanks @conorsch for capturing valuable logs.

Fixes QubesOS/qubes-issues#4890

First the main bug: when meminfo xenstore watch fires, in some cases
(just after starting some domain) XS_Watcher refreshes internal list of
domains before processing the event. This is done specifically to
include new domain in there. But the opposite could happen too - the
domain could be destroyed. In this case refres_meminfo() function raises
an exception, which isn't handled and interrupts the whole xenstore
watch loop. This issue is likely to be triggered by killing the domain,
as this way it could disappear shortly after writing updated meminfo
entry. In case of proper shutdown, meminfo-writer is stopped earlier and
do not write updates just before domain destroy.
Fix this by checking if the requested domain is still there just after
refreshing the list.

Then, catch exceptions in xenstore watch handling functions, to not
interrupt xenstore watch loop. If it gets interrupted, qmemman basically
stops memory balancing.

And finally, clear force_refresh_domain_list flag after refreshing the
domain list. That missing line caused domain refresh at every meminfo
change, making it use some more CPU time.

While at it, change "EOF" log message to something a bit more
meaningful.

Thanks @conorsch for capturing valuable logs.

Fixes QubesOS/qubes-issues#4890
@codecov
Copy link

codecov bot commented Apr 1, 2020

Codecov Report

Merging #331 into master will decrease coverage by 0.07%.
The diff coverage is 0.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #331      +/-   ##
==========================================
- Coverage   63.58%   63.51%   -0.08%     
==========================================
  Files          50       50              
  Lines        8978     8988      +10     
==========================================
  Hits         5709     5709              
- Misses       3269     3279      +10     
Flag Coverage Δ
#unittests 63.51% <0.00%> (-0.08%) ⬇️
Impacted Files Coverage Δ
qubes/tools/qmemmand.py 0.00% <0.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8f0ec59...dd50e30. Read the comment docs.

@conorsch
Copy link

conorsch commented Apr 2, 2020

Looks great! Have patched locally, will keep an eye on behavior and report back.

@qubesos-bot
Copy link

qubesos-bot commented Apr 4, 2020

OpenQA test summary

Complete test suite and dependencies: https://openqa.qubes-os.org/tests/7491#dependencies

Failed tests

New failures

Compared to: https://openqa.qubes-os.org/tests/6362#dependencies

Fixed failures

Compared to: https://openqa.qubes-os.org/tests/6362#dependencies

@conorsch
Copy link

conorsch commented Apr 7, 2020

For what it's worth: I've been running this patch locally for nearly a week, and I've yet to observe a single failure of the qubes-memman service. Have been running watch xl list to observe constant rebalancing. Hardly a definite confirmation of the fix, but it's already a dramatic improvement over the behavior I was seeing before the patch!

So, 👍 on merge from me!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants