-
Notifications
You must be signed in to change notification settings - Fork 111
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cactus fails when saving the HAL file with thousands of genomes #477
Comments
Correct me if I am wrong, which I well may be. I'm suspecting since the error message is truncated on the left (there is no matching '[') that it might be that the maximum command line length may have been exceeded when calling whatever job is exporting the Hal file. If that were so, knowing where is the code, may be one could pass the list of genomes in a file instead of on the command line and then fix the problem? Just thinking aloud. |
Hmm, I think the logger may be to blame for the message being cut. But, I agree that the giant command line is probably behind this. Apparently, it's governed by this which seems pretty big (on my system at least)
But going through a file would certainly fix it if that is the issue. What's weird is that I think there are a few commands that would have already run successfully to get to that point that would include the whole tree in the command line. |
Have dug a bit further. The error seems to be in halAppendCactusSubtree, and has to do with the insertion of the subtree in the HDF5 HAL file which hits a memory limit. When I run manually the command with
I get the following output (BTW, it seems --inMemory is now obsolete and should be changed in the cactus code).
Which seems to be the root of the problem. Now, if someone with better knowledge of the HAL libraries could have a look, maybe it would be possible to fix it by breaking down long trees and fasta files such as these into pieces and adding each piece one by one to the HAL file? Sorry for being such a nuisance, but I am really intent on trying to use CACTUS and would like to do as much as possible before giving up in favor of other, possibly worse, tools. |
As an alternative, would it be possible to run the pangenome pipeline iteratively adding more genomes little by little? |
Do you know if it's taking all the memory of your system when it fails, or do you think it's some limit internal to the library? If it's the former, running without the |
Also, if you're at all able to share the input to that |
What I've done is insert a half-hour sleep in common.py just before the
call to process = subprocess.Popen(...)
This allows me to kill -KILL the python processes and preserve all the
intermediate files for manual testing.
When I run halAppendCactusSubtree manually with '|& tee LOG', I get the
following error, which seems to indicate that I've hit a limit on the
number of links that can be accepted in the HDF5 "hal" file:
HDF5-DIAG: Error detected in HDF5 (1.10.1) thread 0:
#000: H5D.c line 145 in H5Dcreate2(): unable to create dataset
major: Dataset
minor: Unable to initialize object
#1: H5Dint.c line 490 in H5D__create_named(): unable to create and
link to dataset major: Dataset
minor: Unable to initialize object
#2: H5L.c line 1695 in H5L_link_object(): unable to create new link
to object major: Links
minor: Unable to initialize object
#3: H5L.c line 1939 in H5L_create_real(): can't insert link
major: Symbol table
minor: Unable to insert object
#4: H5Gtraverse.c line 867 in H5G_traverse(): internal path
traversal failed major: Symbol table
minor: Object not found
#5: H5Gtraverse.c line 639 in H5G_traverse_real(): traversal
operator failed major: Symbol table
minor: Callback failed
#6: H5L.c line 1742 in H5L_link_cb(): unable to create object
major: Object header
minor: Unable to initialize object
#7: H5O.c line 3178 in H5O_obj_create(): unable to open object
major: Object header
minor: Can't open object
#8: H5Doh.c line 291 in H5O__dset_create(): unable to create dataset
major: Dataset
minor: Unable to initialize object
#9: H5Dint.c line 1256 in H5D__create(): can't update the metadata
cache major: Dataset
minor: Unable to initialize object
#10: H5Dint.c line 916 in H5D__update_oh_info(): unable to update
datatype header message major: Dataset
minor: Unable to initialize object
#11: H5Omessage.c line 183 in H5O_msg_append_oh(): unable to create
new message in header major: Attribute
minor: Unable to insert object
#12: H5Omessage.c line 223 in H5O_msg_append_real(): unable to
create new message major: Object header
minor: No space available for allocation
#13: H5Omessage.c line 1933 in H5O_msg_alloc(): unable to allocate
space for message major: Object header
minor: Unable to initialize object
#14: H5Oalloc.c line 1314 in H5O_alloc(): object header message is
too large major: Object header
minor: Unable to initialize object
terminate called after throwing an instance of 'H5::GroupIException'
So it looks like it is a problem with the HDF5 libraries. On the other
hand, a lookup of the error message on the Net brought up this comment
in a thread on StackOverflow:
"HDF5 has a header limit of 64kb for all metadata of the columns. This
include name, types, etc. When you go about roughly 2000 columns, you
will run out of space to store all the metadata. This is a fundamental
limitation of pytables. I don't think they will make workarounds on
their side any time soon. You will either have to split the table up or
choose another storage format."
Which seems to confirm this. BTW, seems actually the limit may be ~1000
items. OTOH, it seems the limit may have been removed in recent (>1.8)
versions of the HDF5 library (I'm using 1.10). Anyway, it all looks
like an HDF5 library issue. Nevertheles, the info from the HDF5 Group in
https://support.hdfgroup.org/HDF5/faq/limits.html
seems to suggest this shouldn't happen. Thus, maybe the problem is in
the HAL libraries? I'll have a look at this next.
I had a look at the cactus2hal source code, and there are no calls to
HDF5 there, only to HAL library routines.
In the GitHub forum for H5py there is a comment pointing to a
possible solution, from the HDF5 Manual:
"Large Attributes Stored in Dense Attribute Storage
We generally consider the maximum size of an attribute to be 64K bytes.
The library has two ways of storing attributes larger than 64K bytes:
in dense attribute storage or in a separate dataset. Using dense
attribute storage is described in this section, and storing in a
separate dataset is described in the next section. To use dense
attribute storage to store large attributes, set the number of
attributes that will be stored in compact storage to 0 with the
H5Pset_attr_phase_change function. This will force all attributes to be
put into dense attribute storage and will avoid the 64KB size
limitation for a single attribute in compact attribute storage."
So, probably one could fix this in the HAL library. The problem is I do
not know yet exactly where to look at in HAL to fix this. Apparently, by
using H5Pset_attr_phase_change to set the number of attributes to 0 one
could turn off all the limits and allow the system to run. But now I
have to do some (unrelated) statistical analyses, write some
recommendation letters and arrange a few meetings, so I won't likely
be able to pursue this further today.
Should I fill an issue in the HAL library GitHub or can you talk to
them directly?
j
…On Fri, 14 May 2021 09:11:11 -0700 Glenn Hickey ***@***.***> wrote:
Do you know if it's taking all the memory of your system when it
fails, or do you think it's some limit internal to the library? If
it's the former, running without the `--inMemory` option will fix it
(at the cost of being a lot slower).
--
You are receiving this because you authored the thread.
Reply to this email directly or view it on GitHub:
#477 (comment)
--
Scientific Computing Service
Centro Nacional de Biotecnología, CSIC.
c/Darwin, 3. 28049 Madrid
+34 91 585 45 05
+34 659 978 577
|
Thanks for looking into this! I will try to reproduce locally -- it seems that any trivial alignment with enough genomes should do it. If I can reproduce it, then I'll test out For reference, you can use |
I will note that this reproduces very easily. Using the following, it seems the maximum number of genomes I can reach is 545
|
This is the last issue for now: we have been able to run cactus with ~5000 bacterial genomes down to a HAL writing step. It fails with a message saying it received SIGABRT. Since it fails, I do not know for sure if the alignment has been done at this step or if it is still to be done. The log file says
Two things draw my attention: first, the first sequence is named 'enomeXXXX' and not 'genomeXXXX', which might indicate a memory leak or corruption error.
The second (which maybe is explained by the former) is that the job received a SIGABRT signal, indicating it must have tried to do something wrong (from the system's point of view, maybe a stack stash or something alike, I don't know for there is no backtrace).
We are running cactus in a machine with 1TB RAM, 100 CPUs and one Nvidia GPU, using a temporary directory on a 2TB SSD disk, and storing the results in a network-mounted HD with 32TB (~16TB free). So, it is doubtful that the problem lies in exceeding memory, CPU or disk limits. Plus, when re-run with --restart, it fails very quickly.
It fails whether we assign up to 800G RAM memory, up to 500G disk space or up to 50cores. We have also run it with default arguments. In any case, it always fail at the same step with the same error.
To be true at this point I am somewhat at a loss, I am thinking or trying to trace the program step by step but fear it could take almost forever to reach the breaking point, and have many other things to do. If you could shed any light and provide any help, it would be mostly welcome.
Sincerely
JRValverde, CNB-CSIC
The text was updated successfully, but these errors were encountered: