Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add keep-cache option to "unbound-control reload" to keep caches #569

Merged
merged 2 commits into from
Dec 14, 2022

Conversation

jinmeiib
Copy link
Contributor

This PR implements a new option +keep-cache to unbount-control. It allows unbound to keep the RRset and message caches on reloading the configuration.

The main idea is to move the ownership of the per-worker alloc_caches from workers to "daemon" and keep using them when it's deemed to be feasible. In my understanding this would (almost) eliminate the first reason for clearing the caches as commented in daemon.c:daemon_cleanup:

	/* clean up caches because
	 * a) RRset IDs will be recycled after a reload, causing collisions
	 * b) validation config can change, thus rrset, msg, keycache clear */

The second reason is still an issue. And, there are other config changes that would make the reuse of caches impossible or infeasible:

  • changing the number of threads
  • changing size of rrset or message cache (then the corresponding cache is re-created and old cache entries are invalidated anyway)
  • changing the forwarder address
  • probably more

This PR handles the first two cases explicitly (see the change and comment in daemon_apply_cfg), but leaves the consideration on other cases as the operator's responsibility. That's why keeping the cache is optional and non-default.

I have an automated test that confirms the change, but its style is different from other unbound tests, so it's not included in this PR. I'm also aware that some documentation update will be necessary if (the idea of) this PR is adopted, but I'd first like to get feedback on the concept and the current code that implements it.

I ran both make test and make longtest to confirm the change of this PR won't introduce regression. In the latter I saw occaisional failure of tcp_req_order and ssl_req_order, but they also fail in the master branch, so I don't think the failure is due to this PR.

I'm looking forward to any feedback.

@UladzimirTrehubenka
Copy link

@wcawijngaards could you take care?

@gthess gthess self-assigned this Jun 8, 2022
@gthess
Copy link
Member

gthess commented Jun 8, 2022

This is in my todo list but because of more pressing stuff it got jostled away. I will have a look this/next week.

@penguinpee
Copy link

penguinpee commented Jun 27, 2022

I stumbled on this PR by accident. But I very much support this option. In my setup I tried to solve this by using systemd jobs tied to unbound.service that do dump_cache, load_cache respectively. I'd be willing to help with testing and/or documentation, if needed.

Copy link
Member

@gthess gthess left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @jinmeiib!

Apart from the one bug I see, the code looks good.
For testing you could update testdata/09-unbound-control.tdir with the keep-cache test cases.

From a first look I am not 100% set on the +keep-cache extra option for the reload command since it deviates from all the other options.
Maybe a separate reload_keep_cache command to make it more explicit that caution is needed and is not just a feature of the reload command (that should always work)?

This feature is more efficient than dump-and-load, but dump-and-load bypasses the technical difficulties of the threads' number and cache sizes.

Btw, when is such a feature (also the dump-and-load variant) useful?

daemon/daemon.c Outdated
@@ -745,7 +789,12 @@ daemon_cleanup(struct daemon* daemon)
free(daemon->workers);
daemon->workers = NULL;
daemon->num = 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should move after daemon_clear_allocs otherwise memory leaks happen.

Suggested change
daemon->num = 0;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, that's right. Thank you for catching it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've fixed it at commit fccb2eb.

Comment on lines +796 to +797
if(!daemon->reuse_cache || daemon->need_to_exit)
daemon_clear_allocs(daemon);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixing the memory leak

Suggested change
if(!daemon->reuse_cache || daemon->need_to_exit)
daemon_clear_allocs(daemon);
if(!daemon->reuse_cache || daemon->need_to_exit)
daemon_clear_allocs(daemon);
daemon->num = 0;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gthess thank you for the review!

From a first look I am not 100% set on the +keep-cache extra option for the reload command since it deviates from all the other options. Maybe a separate reload_keep_cache command to make it more explicit that caution is needed and is not just a feature of the reload command (that should always work)?

I don't have a strong opinion about it. Please feel free to rename it (or do you want me to update this PR with the change?).

This feature is more efficient than dump-and-load, but dump-and-load bypasses the technical difficulties of the threads' number and cache sizes.

Btw, when is such a feature (also the dump-and-load variant) useful?

I'd leave it to @UladzimirTrehubenka but I guess their operation requires occasional config changes (that don't affect the reuse of the cache) and they don't want to avoid the initial higher latency/lower throughput while populating the cache when the cache isn't reused. I also guess they want to prefer "keep cache" (over dump-and-load) to minimize the down time on reload.

@UladzimirTrehubenka
Copy link

UladzimirTrehubenka commented Oct 6, 2022

Almost year for code review =(

@gthess
Copy link
Member

gthess commented Oct 14, 2022

I would like to have included this on the 1.17.0 release, but time was an issue :(
The PR for me is ready except for renaming the command from reload +keep-cache to reload_keep_cache, and introducing some test cases.
@jinmeiib, how much do you want to be involved with these? I could take over but I would appreciate if you could share your current test cases and I could try incorporating them in Unbound's testing directories.

@jinmeiib
Copy link
Contributor Author

@gthess I'm attaching the example test tpkg to this comment.
infoblox_unbound_reload.tar.gz

@gthess
Copy link
Member

gthess commented Nov 28, 2022

Perfect, thanks!

@gthess gthess added this to the 1.17.1 milestone Nov 29, 2022
Copy link
Member

@gthess gthess left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM now, will make the change for an explicit reload_keep_cache command and add the testcases.

gthess added a commit that referenced this pull request Dec 14, 2022
  'unbound-control reload' to keep caches.
@gthess gthess merged commit 857d6ce into NLnetLabs:master Dec 14, 2022
jedisct1 added a commit to jedisct1/unbound that referenced this pull request Dec 16, 2022
* nlnet/master:
  - Use an explicit 'reload_keep_cache' command and introduce test cases   for NLnetLabs#569.
  prevent memory leak in case cache isn't reused
  add keep-cache option to unbound-control reload to keep caches
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants