Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: shutdown controller after tests #5710

Merged
merged 3 commits into from
Jul 2, 2022

Conversation

mhofman
Copy link
Member

@mhofman mhofman commented Jul 1, 2022

closes: #5447

Description

After investigating #5447, I realized that the culprit was the xsnap child process holding onto the kernel bundles through an obscure chain of retainers.

Security Considerations

None

Documentation Considerations

Any new tests creating a controller should add a teardown hook to shutdown the controller.

Testing Considerations

Ran previously exploding terminate test locally with v8 heap snapshots, and confirmed no leaks were occurring.

Copy link
Member

@turadg turadg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

run-protocol changes LGTM

@mhofman
Copy link
Member Author

mhofman commented Jul 1, 2022

I don't know if this change is related, but it looks like Swingset tests now only take 2/3 minutes!

@mhofman mhofman force-pushed the mhofman/5447-fix-controller-shutdown-leak branch from d64fe38 to 499cda0 Compare July 1, 2022 20:27
Copy link
Member

@warner warner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Swingset changes look good.. one question about the E() in some non-swingset tests though

@@ -71,6 +71,7 @@ test.before(async t => {
async function main(t, argv) {
const { kernelBundles, config } = t.context.data;
const controller = buildVatController(config, argv, { kernelBundles });
t.teardown(E(controller).shutdown);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why the E(controller) ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That test setup used that form and doesn't await the result of buildVatController, so I kept it intact 🤷‍♂️

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, ok. I happened to ask MarkM about that pattern today, from some unrelated code, and he said it was a wart that E() allowed the proxy-trap -returned function to be observed that way (E.get would not, and the tildot syntax would probably treat x = target~.method; x() quite differently than x = target~.method(), so he discouraged me from considering using it.

If controller doesn't last more than a few lines, I'd be willing to rewrite the code to do const controller = await ... But yeah, I'm also ok with matching the existing pattern and calling it a day.

@@ -77,6 +77,7 @@ test.before(async t => {
async function main(t, argv) {
const { kernelBundles, config } = t.context.data;
const controller = buildVatController(config, argv, { kernelBundles });
t.teardown(E(controller).shutdown);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

@@ -114,7 +114,7 @@ export function xsnap(options) {
stdio: ['ignore', stdout, stderr, 'pipe', 'pipe'],
});

xsnapProcess.on('exit', (code, signal) => {
xsnapProcess.once('exit', (code, signal) => {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this just hygiene, or is it possible for processes to .on('exit') multiple times?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not possible for 'exit' to be triggered multiple times (however the base EventEmitter logic doesn't know that), and hygiene dictates that events should be removed once no longer used.

@mhofman mhofman added the automerge:rebase Automatically rebase updates, then merge label Jul 1, 2022
@warner
Copy link
Member

warner commented Jul 1, 2022

Wow, good catch. I remember adding explicit c.shutdown()s to the replay tests that explicitly exercise successive launches of the same state, and to tests that deliberately launch xsnap workers, but I didn't think about the environment variables that can cause the entire test suite to use xsnap. I have a vague memory that the lack of .shutdown made the parent process hang (because Node is waiting for all sockets to close before allowing the process to exit?). Did it look like all none of the AVA processes were ever exiting or something?

Did the memory being held seem appropriate for a live controller to be holding? I know there's some amount of memory needed just for the kernel to do its thing, and having a zillion of them in a single AVA process could be a problem, but it sounded like you were seeing old messages being retained too? That would sound like a different problem that might still be an issue.

@mhofman mhofman force-pushed the mhofman/5447-fix-controller-shutdown-leak branch from 499cda0 to 26766f6 Compare July 1, 2022 23:58
@mhofman
Copy link
Member Author

mhofman commented Jul 2, 2022

xsnap processes were lingering around, and through the pipes from them the whole kernel stayed in memory, including bundles etc. After doing a shutdown, it seems like almost nothing was surviving from one test to another.

The most heap data was the bundles, but there were other/shorter retaining paths to them, which I had to clean up (the stackString logic in SES in particular) before I realized the retaining path was really the xsnap childProcess, at which point I wondered why that was actually there, and realized we needed to explicitely shutdown the controllers.

@warner
Copy link
Member

warner commented Jul 2, 2022

I bet I believe(d) that AVA runs each test file in a separate process, and once all those test cases have finished, the process exits. I bet I also believe(d) that when the process exits, all the child xsnap processes would exit. But 1: I haven't actually tested either belief (I should add some console.log(getpid())s and see if they're distinct), and 2: I suspect AVA is doing something clever with threads that winds up running more than one test file in a single process (based on some GC weirdness we keep seeing), and 3: I can easily imagine that turning into a whole lot of dead kernels being kept in RAM until the larger/aggregate process exits.

If AVA were really enthusiastic about threads, maybe every test it runs would contribute a dead kernel and the VM footprint would grow hugely before finally exiting for real. But only in CI where we use the environment variable to switch on XS on all tests.

@mergify mergify bot merged commit f784f83 into master Jul 2, 2022
@mergify mergify bot deleted the mhofman/5447-fix-controller-shutdown-leak branch July 2, 2022 00:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
automerge:rebase Automatically rebase updates, then merge
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Kernel memory leak causing test SIGKILL
3 participants