New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OutOfMemoryError caused by creation of too many N5ImageLoader fetcher threads #4
Comments
Thanks so much Eric! We should clearly differentiate between fetching metadata (as I do in getTransformedBoundingBox) and actually reading the data, and you're right, we should be able to set the number of threads within the API. We should have a meeting with @tpietzsch to incorporate the necessary changes next year! Happy new year :) |
@StephanPreibisch So I confirmed that the error is solved with I can also see it scales with the number of points selected for the non-rigid, in the second volume (same size) there are ~5x #potins and it runs ~5x slower (took 15h). Thanks!!! |
Hi all, We've tried running the current main branch as both a local Spark instance and spinning up a master-worker Spark instance on our 32 core / 1 TB RAM server. We are trying to Affine fuse an image that will have ~600,000 blocks in the final N5. There are only translation transformations in the XML. No matter how much RAM we allocate to local instance or driver/executors, eventually the execution hits the memory error noted in this thread. We are passing in the jvm flag as proposed in this issue. We are quite excited to get this working, but unless there are some further suggestions we will wait a bit until the code updates are made. Thanks! |
Hi Doug, You may be encountering a different issue - it's hard to know without looking at more details. Helpful starting details would be the explicit exception traceback you get, the specific driver and worker launch commands you used, and basic details about your running environment (Spark version, RAM and cores for driver and workers, local vs. stand-alone cluster, ...). Finally, this specific fetcher thread issue is still a work-in-progress since we are waiting for @tpietzsch 's return from holidays to properly fix it. Best, |
Yes, let's set up a meeting! A related problem/solution is that ideally multiple ImgLoaders should be able to share a FetcherTread pool and the associated queue (of blocks to load). This is something I wanted to look into anyway, and we can solve this simultaneously. |
Did you check that this actually reduces the number of created threads? |
Good point @tpietzsch - I think I got lost while tracing instances and did not realize the imgLoader was held/cached in the dataLocal object. That would explain why my commit did not solve the problem and I needed to resort to the active processor count workaround. Let's discuss whether rolling back my unhelpful commit makes sense when we discuss other related changes at our meeting today. |
With bigdataviewer/bigdataviewer-core#130 it is possible now to specify the number of threads via
For spark, because it does not involve visualization, I would just use |
Ah... actually, you're using So for that, you would do it like final XmlIoSpimData2 io = new XmlIoSpimData2( "" );
final SpimData2 data = io.load( xmlPath );
final BasicImgLoader imgLoader = data.getSequenceDescription().getImgLoader();
if ( imgLoader instanceof ViewerImgLoader )
( ( ViewerImgLoader ) imgLoader ).setNumFetcherThreads( NUMBER_OF_THREADS ); |
Hi @tpietzsch, I finally got time to test your changes tonight (using NUMBER_OF_THREADS=0). After running @boazmohar 's small and big test cases, I did not see any OOM exceptions so I think your fix worked. Please merge and deploy your fix at your earliest convenience (along with the spim_data thread safety change). Once the updated bigdataviewer-core and spim_data packages are deployed, I'll update BigStitcher-Spark to use them. Thanks! |
While working through issue 2 with a larger data set, @boazmohar discovered many
OutOfMemoryError: unable to create new native thread
exceptions in the worker logs. These exceptions are raised because parallelized RDDs create many N5ImageLoader instances like this one and each N5ImageLoader instance in turn createsRuntime.getRuntime().availableProcessors()
fetcher threads.I reduced some of the fetcher thread creation by reusing loaders in this commit. However, reusing loaders did not completely solve the problem.
I think the best solution is to parameterize the number of fetcher threads in the N5ImageLoader and then explicitly set fetcher thread counts in spark clients. This issue can remain open until that happens or until another long term solution is developed.
In the mean time as a work-around, overriding the default availableProcessors value with a
-XX:ActiveProcessorCount=1
JVM directive seems to fix the problem.More specifically, here are the spark-janelia flintstone.sh environment parameters I used to successfully process @boazmohar 's larger data set:
With the limited active processor count and reusing loaders, no OutOfMemory exceptions occur and processing completes much faster. @boazmohar noted that with his original setup, it took 3.5 hours using a Spark cluster with 2011 cores. My run with the parameters above took 7 minutes using 2200 cores (on 200 11-core worker nodes). Boaz's original run might have had other configuration issues, so this isn't necessarily apples-to-apples. Nevertheless, my guess is that his performance was adversely affected by the fetcher thread problem.
Finally, @StephanPreibisch may want to revisit the getTransformedBoundingBox code and any other loading/reading to see if there are other options for reducing/reusing loaded data within the parallelized RDD loops. Broadcast variables might be suitable/helpful for this use case - but I'm not sure.
The text was updated successfully, but these errors were encountered: