New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory error when trying to get metadata from huge file #8

Closed
jni opened this Issue May 2, 2014 · 16 comments

Comments

Projects
None yet
3 participants
@jni
Copy link

jni commented May 2, 2014

Hi,

I have a 383GB Leica .lif file, which obviously I would like to process sequentially. (It contains a large number of 1024 x 1024 x 30 x 3channel images.) I figured I would read the metadata and then use bioformats.load_image with the right parameters to load each image series. (Incidentally, it would be great to allow z=None to load up the entire stack! Happy to submit a PR if that would be welcome and you can point me in the right direction!)

Here's my code snippet:

import javabridge as jv
import bioformats as bf
jv.start_vm(class_path=bf.JARS)
md = bf.get_omexml_metadata('Long time Gfap 260314.lif')

And here's the resulting error:

---------------------------------------------------------------------------
JavaException                             Traceback (most recent call last)
<ipython-input-7-640f92a788d5> in <module>()
----> 1 md = bf.get_omexml_metadata('Long time Gfap 260314.lif')

/Users/nuneziglesiasj/anaconda/lib/python2.7/site-packages/bioformats/formatreader.pyc in get_omexml_metadata(path, url)
    943         xml;
    944         """
--> 945         xml = jutil.run_script(script, dict(path=rdr.path, reader = rdr.rdr))
    946         return xml

/Users/nuneziglesiasj/anaconda/lib/python2.7/site-packages/javabridge/jutil.pyc in run_script(script, bindings_in, bindings_out, class_loader)
    338              "Ljava/lang/Object;)"
    339              "Ljava/lang/Object;",
--> 340              scope, script, "<java-python-bridge>", 0, None)
    341         result = unwrap_javascript(result)
    342         for k in list(bindings_out):

/Users/nuneziglesiasj/anaconda/lib/python2.7/site-packages/javabridge/jutil.pyc in call(o, method_name, sig, *args)
    888     ret_sig = sig[sig.find(')')+1:]
    889     nice_args = get_nice_args(args, args_sig)
--> 890     result = fn(*nice_args)
    891     x = env.exception_occurred()
    892     if x is not None:

/Users/nuneziglesiasj/anaconda/lib/python2.7/site-packages/javabridge/jutil.pyc in fn(*args)
    855             x = env.exception_occurred()
    856             if x is not None:
--> 857                 raise JavaException(x)
    858             return result
    859     else:

JavaException: Java heap space

What's the correct way to use python-bioformats to read in a file one series at a time? Or is such functionality not currently supported?

Thanks!

@jni

This comment has been minimized.

Copy link

jni commented May 2, 2014

I should mention: not all series have the same number of z planes, and the number of series is not known a priori, and the series name is important in downstream processing.

@ljosa

This comment has been minimized.

Copy link
Contributor

ljosa commented May 2, 2014

You can use the max_heap_size keyword argument to start_vm to give the JVM more memory: Starting and killing the JVM.

But you probably don't have 383 GB of RAM … let's ask @LeeKamentsky if it's possible to have Bioformats read only the file in parts.

@jni

This comment has been minimized.

Copy link

jni commented May 2, 2014

Your assumption that I have strictly less than 383GB of RAM lying around is correct. ;)

@LeeKamentsky

This comment has been minimized.

Copy link

LeeKamentsky commented May 2, 2014

The best strategy for reading stacks is to use bioformats.get_image_reader() to get a reader or to use a construct like the following:

with bioformats.ImageReader(path) as rdr:
    for z in range(rdr.rdr.getSizeZ()):
        plane = rdr.read(z=z)
        do_something_useful_with(plane)

That will cache the parsing information.

As for the OME XML, it's possibly a question for the OME team (http://www.openmicroscopy.org/site/support/bio-formats5/). You can extract the XML using their tools only like this:

java -cp <path-to-python>\lib\site-packages\bioformats\jars\loci_tools.jar loci.formats.tools.ImageInfo -nopix -omexml <path-to-image-file>

You can increase the memory size on the Java command-line using the -Xmx switch, e.g. -Xmx400G which should be commonplace in about 12 years if Moore's law still holds.

If you have big planes (e.g. 10K x 10K), you can read tiles - it's just a little clumsy:

with bioformats.ImageReader(path) as rdr:
    format_reader = rdr.rdr # this is the wrapper around the Bio-Formats class
    format_reader.setSeries(series) # choose the correct series in the file
    index = format_reader.getIndex(z, c, t) # get the plane index you want
    img = format_reader.openBytesXYWH(index, xoff, yoff, width, height) # Read a tile
    img.shape = (height, width) # change this to (height, width, 3) if color
    do_something_useful_with(img)

You can also refrain from reading the metadata as XML if all you need is the series and stack-size information.

with bioformats.ImageReader(path) as rdr:
    format_reader = rdr.rdr
    print "# series = %d" % format_reader.getSeriesCount()
    for i in range(format_reader.getSeriesCount()):
        print "sizeC=%d, sizeZ=%d, sizeT=%d" % (format_reader.getSizeC(), format_reader.getSizeZ(), format_reader.getSizeT())

Hope this gets you over the learning-curve hump.

@jni

This comment has been minimized.

Copy link

jni commented May 3, 2014

Hi @LeeKamentsky!

It looks like getSeriesCount, setSeries, and getSizeZ will combine to do what I need... I'll report back in a bit. I'll also need the series name.

How come these aren't documented here? (Not being snarky, I'm just surprised since the doc looks auto-generated!)

Regarding the XML, it should be a small amount of data, right? Is python-bioformats reading the full image in order to get the metadata? I'm not sure what I'm supposed to ask of the OME team...?

Finally, what's the logic behind the rdr.rdr syntax? (versus just putting that functionality in the original rdr)

Thanks for the help!

@ljosa

This comment has been minimized.

Copy link
Contributor

ljosa commented May 3, 2014

We tried to start with documenting a small interface that is comprehensible and that we can keep stable rather than everything that happens to be implemented.

The docs are partially autogenerated: the system pulls some things from the code, but we still decides what goes in the docs. Pull requests are welcome. Source code for the docs.

@ljosa

This comment has been minimized.

Copy link
Contributor

ljosa commented May 3, 2014

As for rdr.rdr: the inner rdr wraps either an Opero reader or a regular Bio-Formats reader, depending on the path. One argument for doing it the way it's done is that the classes in the Python interface map cleanly to classes in the Java interface, so it will be easier to use Bio-Formats docs and source for reference and easier for the python interface to track the Java interface as it evolves. @LeeKamentsky may have had other reasons as well.

I'm sure there are opportunities for smoothing out the interface (and suggestions are welcome); we just haven't been too eager to do so, both because of the work involved and because we need to be careful not to paint ourselves into a corner. A slightly clunky but stable API is better than a beautiful API that changes every week.

@LeeKamentsky

This comment has been minimized.

Copy link

LeeKamentsky commented May 5, 2014

I think that the format reader class should be documented since it's pretty useful - that it's not is mostly an accident caused by the way the class is generated on the fly (I do a lazy load so the Javabridge doesn't throw an exception while loading the bioformats module because the .JAR wasn't on the classpath). I'm also thinking that loading tiles is something that should be supported by keyword arguments to ImageReader.read.

In any case, I'll look at getting the FormatReader docs put into the official documentation.

@jni

This comment has been minimized.

Copy link

jni commented May 22, 2014

Hi @LeeKamentsky, @ljosa, I tried the very simplest approach:

filename = 'Long time Gfap 260314.lif' # 400GB
with bf.ImageReader(filename) as rdr:
    reader = rdr.rdr
    print("The file contains %i series." % reader.getSeriesCount())

But I still get a java.lang.OutOfMemoryError! It's a bit hairier, too, because it's hidden by a Glacier2/PermissionDeniedException. Here's the error in the IPython notebook:

Failed to get class Glacier2/PermissionDeniedException
No handlers could be found for logger "bioformats.formatreader"
---------------------------------------------------------------------------
JavaException                             Traceback (most recent call last)
<ipython-input-3-1cb04f462e8d> in <module>()
      1 # Nope: this gets the same exception!
      2 filename = 'Long time Gfap 260314.lif'
----> 3 with bf.ImageReader(filename) as rdr:
      4     reader = rdr.rdr
      5     print("The file contains %i series." % reader.getSeriesCount())

/Users/nuneziglesiasj/anaconda/lib/python2.7/site-packages/bioformats/formatreader.pyc in __init__(self, path, url, perform_init)
    632         self.rdr.o = jrdr
    633         if perform_init:
--> 634             self.init_reader()
    635 
    636     def __enter__(self):

/Users/nuneziglesiasj/anaconda/lib/python2.7/site-packages/bioformats/formatreader.pyc in init_reader(self)
    673             je = e.throwable
    674             if jutil.is_instance_of(
--> 675                 je, "Glacier2/PermissionDeniedException"):
    676                 # Handle at a higher level
    677                 raise

/Users/nuneziglesiasj/anaconda/lib/python2.7/site-packages/javabridge/jutil.pyc in is_instance_of(o, class_name)
    809     jexception = get_env().exception_occurred()
    810     if jexception is not None:
--> 811         raise JavaException(jexception)
    812     result = env.is_instance_of(o, klass)
    813     jexception = get_env().exception_occurred()

JavaException: Glacier2/PermissionDeniedException

And here's the IPython console output:

log4j:WARN No appenders could be found for logger (loci.common.NIOByteBufferProvider).
log4j:WARN Please initialize the log4j system properly.
Exception in thread "Thread-2" java.lang.OutOfMemoryError: Java heap space
    at java.lang.StringCoding$StringDecoder.decode(StringCoding.java:133)
    at java.lang.StringCoding.decode(StringCoding.java:173)
    at java.lang.StringCoding.decode(StringCoding.java:185)
    at java.lang.String.<init>(String.java:570)
    at java.lang.String.<init>(String.java:593)
    at loci.common.RandomAccessInputStream.readString(RandomAccessInputStream.java:357)
    at loci.formats.in.LIFReader.initFile(LIFReader.java:377)
    at loci.formats.FormatReader.setId(FormatReader.java:1072)
Exception in thread "Thread-2" java.lang.NoClassDefFoundError: Glacier2/PermissionDeniedException
Caused by: java.lang.ClassNotFoundException: Glacier2.PermissionDeniedException
    at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
    at java.security.AccessController.doPrivileged(Native Method)
    at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
    at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:247)

Even replacing the entire scope with pass fails! So it seems ImageReader is trying to open the full dataset by default...! Any ideas?

@LeeKamentsky

This comment has been minimized.

Copy link

LeeKamentsky commented May 22, 2014

I'm looking at where this is running out of memory - it's running out in the Bio-Formats LIFReader - this is the line that corresponds to your stack trace in the latest version:

https://github.com/openmicroscopy/bioformats/blob/develop/components/formats-gpl/src/loci/formats/in/LIFReader.java#L407

If you look at the code, it's going to read in the XML description of your file all in one shot and the code has no choice but to read the whole thing into memory. Looking at how you start the VM, you're running with the default memory allocation which is "only" 256MB. You can start the VM like this:

import javabridge as jv
import bioformats as bf
jv.start_vm(class_path=bf.JARS, max_heap_size="2G")

and it will give you some more headroom for dealing with your large file. It might be good to know the size of that chunk of XML. I don't have an .LIF file, so I can't test this, but I think the following code snippet should print out the size of the XML chunk:

import numpy as np
with open('Long time Gfap 260314.lif', "rb") as fd:
    fd.read(9)
    length = np.frombuffer(fd.read(4), "<i4")
    print length[0]

I wish I had better news, but if that number is in the 100 MB range, things will be difficult. The XML is copied several times during processing and then a DOM tree is generated for it - the tree will take up much more memory than the string itself. It might make sense to write a script that will extract the individual planes into separate TIF files, even though that will cost you quite a lot of disk space, otherwise, you'll have to parse the XML each time you open a reader on the file.

@jni

This comment has been minimized.

Copy link

jni commented May 22, 2014

@LeeKamentsky That's not bad news at all! It actually had not occurred to me that (a) the default max_heap_size would be so low, and (b) the metadata by itself could actually be blowing through that heap size! Setting it to 8GB solved the problem!

screen shot 2014-05-22 at 11 29 20 pm

@jni

This comment has been minimized.

Copy link

jni commented May 22, 2014

Even my original get_omexml_metadata command works fine with the bigger heap size! =) (feeling rather silly...)

@LeeKamentsky

This comment has been minimized.

Copy link

LeeKamentsky commented May 22, 2014

Glad to hear it - 27 MB sounds manageable, hope it's smooth sailing for you going forward.

@jni

This comment has been minimized.

Copy link

jni commented May 23, 2014

@LeeKamentsky Sorry, a couple of more questions about memory management:

  • I don't understand exactly what's happening with start_vm and kill_vm. Do I understand correctly from the docs that I shouldn't call kill_vm until the very end of my IO operations? (Rather than, say, after reading in each stack.) If my program crashes or closes, does the VM automatically get killed?
  • Also, if I'd rather not use the context manager for ImageReader() (as it's cumbersome for interactive exploration), can I just avoid it by calling the close() function manually when I'm done with it?
@LeeKamentsky

This comment has been minimized.

Copy link

LeeKamentsky commented May 23, 2014

The Java VM can only be run once and your program won't exit until it's killed. So you should start the VM early in your program's initialization and kill it only when you're sure you won't need it again. You can't restart the Java VM after killing it - JNI doesn't support that, not my choice. The VM is in-process, so if your program exits (e.g. os._exit(0)), the VM is dead as well.

You can go ahead and call ImageReader.close() - same thing as using the context. Even if you don't, garbage collection will take care of everything except for deleting a temporary file created if you open a URL image.

@jni

This comment has been minimized.

Copy link

jni commented May 25, 2014

@LeeKamentsky,

JNI doesn't support that, not my choice.

Excuse me, I'm JNI and I strongly support that!

=P

Seriously, thank you very much for all your help. I should be golden from here, and I hope to have time for some useful PRs soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment