Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix puml plugin spawning zombie jvm processes #1245

Merged
merged 1 commit into from
Jul 10, 2020

Conversation

ang-zeyu
Copy link
Contributor

@ang-zeyu ang-zeyu commented Jun 13, 2020

What is the purpose of this pull request? (put "X" next to an item, remove the rest)

• [x] Bug fix

Fixes #1243 (partially)
Fixes #1237

What is the rationale for this request?
Fix puml plugin spawning many zombie jvm processes

Proposed commit message: (wrap lines at 72 characters)
Fix puml plugin spawning zombie jvm processes

Asynchronously executing the puml child process results in zombie jvm
processes as there are multiple callbacks attached to it and page
generation is heavily callback based.

Let's wrap the plantUML diagram generation in a setImmediate call,
delaying said generation to after page generation has completed.
This resolves the zombie processes present during page generation,
which may cause persistent freezes on a user's machine, while
preserving the non-blocking behaviour of puml diagram generation.

@ang-zeyu ang-zeyu force-pushed the fix-puml-zombie-process branch 2 times, most recently from cb42470 to a4d7995 Compare June 13, 2020 11:07
@damithc
Copy link
Contributor

damithc commented Jun 19, 2020

@ang-zeyu does this fix work?

@ang-zeyu
Copy link
Contributor Author

@damithc it should be good to go.

A caveat though:

Originally I changed the puml process execution to synchronous but this slows build times tremendously (~20%) since it's blocking.

This 1 line solution fixes it as well and maintains build times, but essentially what it does is postpone all puml diagram generation to the end. Meaning you may still experience a 1-2s freeze at the end (instead of freezing throughout the build).

Not too sure which is better 😮

@damithc
Copy link
Contributor

damithc commented Jun 19, 2020

This 1 line solution fixes it as well and maintains build times, but essentially what it does is postpone all puml diagram generation to the end. Meaning you may still experience a 1-2s freeze at the end (instead of freezing throughout the build).

Have you pushed this 1-line solution? The diff I see is bigger than one line.

@damithc
Copy link
Contributor

damithc commented Jun 19, 2020

BTW, I suspect the original version spawned DxR number of JVMs where D is the number of diagrams and R is the number of times they are reused. 10x3=30. Imagine I add 20 more diagrams; that's going to be 90 JVMs! I wonder if it is possible to limit the number of threads i.e., set a max thread pool size.

@damithc damithc requested a review from alyip98 June 19, 2020 13:02
@damithc
Copy link
Contributor

damithc commented Jun 19, 2020

@alyip98 did you come across this problem when you were implementing this feature?

@ang-zeyu
Copy link
Contributor Author

Have you pushed this 1-line solution? The diff I see is bigger than one line.

yup its the current ^; it's the setImmediate line, the rest are just due to indent changes

BTW, I suspect the original version spawned DxR number of JVMs where D is the number of diagrams and R is the number of times they are reused. 10x3=30. Imagine I add 20 more diagrams; that's going to be 90 JVMs! I wonder if it is possible to limit the number of threads i.e., set a max thread pool size.

Noticed this as well, it happens as we are generating "one png per file location it is included at" instead of "one png per puml definition". Now that you mention it, I'll try including it here as well since it's rather related.

@ang-zeyu
Copy link
Contributor Author

@damithc could you test this out?

It should

  • spin up D jvms at most now
  • all D jvms should only spin up nearing the end of build (around when Pages built! gets logged) and close almost immediately once it's done

I'm not too sure if there was a reason for generating diagrams per use instead of per definition as well though @alyip98

@damithc
Copy link
Contributor

damithc commented Jun 20, 2020

Tried it out. There is a huge improvement. No more than 4 JVMs were active at any time and near-100% CPU usage lasted only 1-2 seconds. The total time seems to have improved too (it's now about 250 seconds, used to take 300-600 depending on the amount of thrashing), possibly due to less thrashing. Nice fix @ang-zeyu 💯

@ang-zeyu
Copy link
Contributor Author

realised this should fix #1237 as well, modified the tests a little to make sure of it.

Copy link
Contributor

@marvinchin marvinchin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little hesitant about using the event loop and setImmediate to defer the puml image generation to the end of the build process. Depending on how other callbacks are handled in the poll queue, the behaviour might not be as we expect (please correct me if I'm wrong here, it's been a while since I had to look at the node event loop!).

Do you think we can make deferring the image generation more explicit by moving it to postRender? Hopefully it achieves the same effect, since postRender callbacks are executed after the expensive render step is done (although not at the end of the build process, as the current implementation seems to do).

processedDiagrams.add(outputFilePath);
function generateDiagram(imageOutputPath, content) {
// Avoid generating twice
if (processedDiagrams.has(imageOutputPath)) return;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a nit, but shall we have braces around the body of the if statement for consistency?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little hesitant about using the event loop and setImmediate to defer the puml image generation to the end of the build process. Depending on how other callbacks are handled in the poll queue, the behaviour might not be as we expect (please correct me if I'm wrong here, it's been a while since I had to look at the node event loop!).

I think we should be good here 😬, https://nodejs.org/en/docs/guides/event-loop-timers-and-nexttick/, setImmediate is executed in the check phase, only after all callbacks (ie. all of our page.generate()s) have resolved.

Moving part of it to postRender works a well, but to a lesser extent as you mentioned. It also involves some extra parsing work as we can only move part and not the whole plugin to postRender

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ideal solution would to be find some way to let prioritise the resource cleanup callbacks (im guessing that's how node does it) after the puml processes have finished. There dosen't seem to be an interface for this after hours of searching though, which makes sense since node's event loop wasn't really built for continuous execution of a massive number of callbacks.

An alternative solution, is that if and when we manage to parallelise page generation, we could change the execution back to synchronous. This way we have a good amount of thread usage still 🎉

@ang-zeyu ang-zeyu force-pushed the fix-puml-zombie-process branch 2 times, most recently from e2f153e to 26324d9 Compare July 1, 2020 10:54
@ang-zeyu
Copy link
Contributor Author

ang-zeyu commented Jul 1, 2020

Just a nit, but shall we have braces around the body of the if statement for consistency?

Resolved this, and an import path conflict with the directory structure pr that was causing ci to fail:

// plantuml plugin
const { ensurePosix } = require('@markbind/core/src/utils');

@ang-zeyu
Copy link
Contributor Author

ang-zeyu commented Jul 5, 2020

Gave it some more thought (thanks for the comment on setImmediate!) @marvinchin, realised there was an even better solution all along, which is to split page generation up instead of creating all the page.generate() promises at once. (only the setImmediate is removed, and the page generation code in site.js was reorganized slightly to achieve this)

This should also ensure, as opposed to the previous 2 solutions:

  • no freezes at all (even at the end) (the jvms close up almost as soon as they are done)
  • the progressbar now also 'works' (instead of jumping from 0/777 to 777/777)

spin up D jvms (in total, throughout the course of markbind build) at most now

Realised that this wasn't really true either (notice the amount of jvms popping up and going away in the gif), it only appeared so as the previous solution generated everything at once (almost).
The set which contains the filepaths of processed diagrams (processedDiagrams) is cleared before each run of preRender, which is executed per-page. It's still better than before (generating one per include location) though.

We currently don't have a clean way to maintain 'state' for plugins (e.g. a beforeSiteGenerate hook), so I think it'd be better to delay achieving this till then. (still a better solution imo, but hotfix pushed below #1245 (comment))

progressbar

@damithc
Copy link
Contributor

damithc commented Jul 5, 2020

I tried this version. Definitely better than the master but seems slightly worse than the previous version. Seems to take about 20s longer and more near-100% peaks than before. The progress bar works better though.
You can use this version of the website for testing as that's the one I'm using (it has a few more uml diagrams) https://github.com/nus-cs2103-AY2021S1/website

@damithc
Copy link
Contributor

damithc commented Jul 5, 2020

As I was trying this branch using DOS window (I usually use Git Bash), I noticed this error. I was using lazy preview. The page gets build but this error appears in the log every time the page gets rebuit. Doesn't show up when using git bash though. Just posting here for the record. Will investigate more to try to isolate if you cannot reproduce it easily.
image

@ang-zeyu
Copy link
Contributor Author

ang-zeyu commented Jul 6, 2020

I tried this version. Definitely better than the master but seems slightly worse than the previous version. Seems to take about 20s longer and more near-100% peaks than before. The progress bar works better though.
You can use this version of the website for testing as that's the one I'm using (it has a few more uml diagrams) https://github.com/nus-cs2103-AY2021S1/website

It's rather expected as the set we use to check whether a diagram has been processed has always been cleared at the start of rendering each page 🙈.

I pushed a simple hotfix, that is to basically check whether the output file already exists. Works fine on my machine, could you give it a try?

As I was trying this branch using DOS window (I usually use Git Bash), I noticed this error. I was using lazy preview.

Thanks! Fixed. Was simply an issue with trying to render 0/0 progress bar as it was introduced to page regeneration as well (since in lazy reload mode its possible to generate 0 pages)

@damithc
Copy link
Contributor

damithc commented Jul 6, 2020

I pushed a simple hotfix, that is to basically check whether the output file already exists. Works fine on my machine, could you give it a try?

Much better. Only one CPU peak (very brief) and finished in 260s (previous version took 350!). A very noticeable improvement.

@ang-zeyu
Copy link
Contributor Author

reverted it, realised using a file check breaks live reload unfortunately, since files from previous runs aren't deleted.
Will fix in full in #1228, once I get a proper plugin hook out for this.

@marvinchin let me know if there are any other issues here!

@damithc
Copy link
Contributor

damithc commented Jul 10, 2020

reverted it, realised using a file check breaks live reload unfortunately, since files from previous runs aren't deleted.
Will fix in full in #1228, once I get a proper plugin hook out for this.

Just now I was wondering why the diagram is not getting updated in live preview :-)

@ang-zeyu
Copy link
Contributor Author

🙃

Copy link
Contributor

@marvinchin marvinchin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks fine to me! Suggested adding a comment to make it easier to understand how the throttling works.

reject(new Error(`Error while generating ${page.sourcePath}`));
}));

pageGenerationQueue.splice(0, MAX_CONCURRENT_PAGE_GENERATION_PROMISES)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think this comment would make how we handle the max concurrency clearer?

// Take the first MAX_CONCURRENT_PAGE_GENERATION_PROMISES callbacks and execute them. When a page generation callback completes, it pops the next unprocessed callback off pageGenerationQueue and executes it.

Asynchronously executing the puml child process results in zombie jvm
processes as there are multiple callbacks attached to it and page
generation is heavily callback based.

Let's break page generation into chunks, which allows node to clean up
the zombie processes in a timely fashion.

Let's also change the generated image's path to resolve from the
location of it's definition (ie. the file where the <puml> tag is
used), to reduce the amount of jvms spawned through including the same
file multiple times.
@ang-zeyu ang-zeyu added this to the v2.15.0 milestone Jul 10, 2020
@ang-zeyu
Copy link
Contributor Author

updated and squashed, thanks again for looking at this @marvinchin! 🙂

@ang-zeyu ang-zeyu merged commit ad4e90c into MarkBind:master Jul 10, 2020
@tlylt
Copy link
Contributor

tlylt commented Apr 18, 2022

Hi @ang-zeyu, I am working on fixing something related to plantUML checking errors before image generation (to prevent generating images that basically contain the error message).

I realized that I could possibly spin up only one JVM process after collecting all diagram codes and then pipe it through that java process to generate the images before the site build is done (might need to handle live preview differently), wonder if you still have the context on this PR and if this approach sounds promising? I suspect if this works then it will reduce the overhead of spinning up and shutting down JVM.

@ang-zeyu
Copy link
Contributor Author

Sounds interesting! @tlylt

might need to handle live preview differently

If I still remember the implementation details this should be fine (if spinning up the JVM at the right place).

Just note that we might prefer to keep the implementation strictly isolated as a plugin, but this wouldn't be possible right now with what's supported here. https://markbind.org/devdocs/devGuide/writingPlugins.html
(might need to add something like beforeSiteGenerateComplete)


context on this PR

(should be irrelevant to your solution though)

We were previously throwing N page generation callbacks into the event loop. This isn't an issue most of the time but for a large site like the 2103 one (700+ pages), the many JVMs spawned midway don't get to close due to way exec + the event loop works.

We've since then "staggered" the page generation callbacks (throw a fixed K into the loop at a time until all N pages are done), so the JVMs can close even midway.

@ang-zeyu
Copy link
Contributor Author

@tlylt I think one issue might be the memory cost (which was the main issue here with zombie processes) of generating all images at once with one jvm.
You could try launching the jvm on a whole bunch (50+?) of images first to see how puml deals with this / whether its feasible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants