pick : overly long and high memory usage #207

fmarletaz · 2019-08-08T08:35:23Z

Hi - thanks for developing such a great tool mikado. I have been running it on several genomes without problem. However, for a recent project, I have some issues at the final mikado pick stage which takes a very long time (several days) and seems to require >100Gb of memory when using 8 cores. The problem might be that the RNA-seq data are relatively noisy (a lot of low coverage non-coding transcripts) that I tried to filter out. Is there any parameter that I could tweak to fix that? Any suggestions. I am happy to share the log if necessary. Thanks again!

The text was updated successfully, but these errors were encountered:

lucventurini · 2019-08-08T10:33:15Z

Hi @fmarletaz , thank you for reporting the issue. We recently had a similar issue that was caused by excessive locus chaining. Basically what happened was that there was a relatively low number of transcripts with enormous introns and/or exons (e.g. over a megabasepair long!). This caused Mikado to gather hundreds of thousands of transcripts into a single chunk, which drove up memory and runtime to insane levels (Mikado was written to deal with hundreds or at most few thousands of transcripts per chunk, and in particular the graph algorithms we are using can scale up poorly).

In order to prevent small numbers of rogue transcripts to crash the program, we recently implemented a new parameter, "max_intron", for mikado prepare (see issue #186, in particular here, and commit a131b94). This change is implemented already in 2.0rc3.

There are multiple ways of checking whether this is the case; a possible way for example would be to run gffread -i 100000 -T -o mikado_prepared.GFFREAD.gtf mikado_prepared.gtf and see how many transcripts get discarded (-i 100000 tells gffread to discard any transcript with introns larger than 100kbps, -T for GTF output). You could then give the filtered GTF to mikado pick.

If this does not solve the problem, please let us know as soon as possible, so that we can investigate further!

fmarletaz · 2019-08-09T06:52:04Z

Hi - Thanks a lot for your response! I ran the command you suggested, which removed 13,939 transcript from the `mikado_prepared.gtf`(original number 943110). Do you think that would be enough long transcripts to cause a problem? I will try the options you suggested (1) try mikado 2.0r3 with max_intron and (2) try mikado pick with the filtered gtf. Btw, I remember we talked perviously about a script to use mikado transcripts to add UTRs to Augustus generated gene-models, do you plan to release it soon?

…

On August 8, 2019 at 19:33, Luca Venturini wrote: Hi @fmarletaz (https://github.com/fmarletaz) , thank you for reporting the issue. We recently had a similar issue that was caused by excessive locus chaining. Basically what happened was that there were a relatively low number of transcripts with enormous introns and/or exons (e.g. over a megabasepair long!). This caused Mikado to gather hundreds of thousands of transcripts into a single chunk, which drove up memory and runtime to insane levels (Mikado was written to deal with hundreds or at most few thousands of transcripts per chunk, and in particular the graph algorithms we are using can scale up poorly). In order to prevent small numbers of rogue transcripts to crash the program, we recently implemented a new parameter, "max_intron", for mikado prepare (see issue #186 (#186), in particular (here)[github.com (https://github.com/)/issues/186 (#186)#issuecomment-503542113], and commit a131b94 (a131b94)). This change is implemented already in (2.0rc3)[github.com/EI-CoreBioinformatics/mikado/releases/tag/2.0rc3] (https://github.com/EI-CoreBioinformatics/mikado/releases/tag/2.0rc3]). There are multiple ways of checking whether this is the case; a possible way for example would be to run gffread -i 100000 -T -o mikado_prepared.GFFREAD.gtf mikado_prepared.gtf and see how many transcripts get discarded (-i 100000 tells gffread to discard any transcript with introns larger than 100kbps, -T for GTF output). You could then give the filtered GTF to mikado pick. If this does not solve the problem, please let us know as soon as possible, so that we can investigate further!

lucventurini · 2019-08-09T09:20:23Z

I ran the command you suggested, which removed 13,939 transcript from the mikado_prepared.gtf(original number 943110). Do you think that would be enough long transcripts to cause a problem?

Yes, it most definitely would. Mikado was written with a partitioning algorithm that is based solely on transcripts' coordinates - so that even a handful of rogue transcripts like those will cause Mikado to gobble too many transcripts in a single locus, with the consequences you saw.

I will try the options you suggested (1) try mikado 2.0r3 with max_intron and (2) try mikado pick with the filtered gtf.

Great! Please let us know how it goes.

Btw, I remember we talked perviously about a script to use mikado transcripts to add UTRs to Augustus generated gene-models, do you plan to release it soon?

So, this is a bit awkward ... Mikado should now have all the necessary infrastructure to perform this operation, but we have not written a straightforward way to perform it. And actually, it is not completely possible to do so due to some restrictions in the configuration.

We have started to work on this--mikado pick for example now has a --only-reference-update key, which means that it will discard any locus that only contains non-reference transcripts, ie. in your case, non-Augustus transcripts--but it is not finished yet.

lucventurini · 2019-08-12T14:31:10Z

Hi @fmarletaz , I had quite a sprint during the weekend. In the latest commit (26adc8b) I have changed the way that Mikado calculates the initial transcript graph. The change should make Mikado much more robust to huge loci, as I think that the bottleneck was the creation of the initial transcript graph. Hopefully in the future this should help solve problems like those you found yourself.

fmarletaz · 2019-08-14T06:58:24Z

Thanks a lot! I am going to try it!

lucventurini · 2019-08-14T07:41:26Z

OK! I am still experimenting on this as well - I think I have seen some improvement but it is still work in progress. Namely, now Mikado will try to prune relatively aggressively a locus if there are too many transcripts. Please follow branch 207 for updates to the code!

…reation and checking to the subprocesses. This should speed it up a lot in the presence of complex, long loci.

lucventurini · 2019-08-15T13:50:25Z

Dear @fmarletaz , now changes have been merged into master.
Briefly, now Mikado pick will be much more efficient in using multiple processes, and it will detect during the run whether any transcript has an intron which is too long (default is 1 Mbps; according to ENCODE, you might lose some genuine examples, but I would argue that for the vast, vast majority of cases this is sensible).

With these changes, Mikado was able to analyse >1 million transcripts in about one hour, with a peak memory of 11GB, using 32 parallel processes. I think this is a reasonable performance for my program, given that it is not written in Python rather than a more performant language (eg C++) and the fact that there are disk bottlenecks (querying the database).

As such I will close the issue down for now.

lucventurini self-assigned this Aug 8, 2019

lucventurini added enhancement question labels Aug 8, 2019

lucventurini added this to the 2.0 milestone Aug 8, 2019

lucventurini mentioned this issue Aug 9, 2019

Adding UTRs to transcripts in the reference update mode #208

Closed

lucventurini added a commit to lucventurini/mikado that referenced this issue Aug 14, 2019

EI-CoreBioinformatics#207: now Mikado pick will delegate transcript c…

adff915

…reation and checking to the subprocesses. This should speed it up a lot in the presence of complex, long loci.

lucventurini closed this as completed Aug 15, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pick : overly long and high memory usage #207

pick : overly long and high memory usage #207

fmarletaz commented Aug 8, 2019

lucventurini commented Aug 8, 2019 •

edited

Loading

fmarletaz commented Aug 9, 2019 via email

lucventurini commented Aug 9, 2019

lucventurini commented Aug 12, 2019

fmarletaz commented Aug 14, 2019

lucventurini commented Aug 14, 2019

lucventurini commented Aug 15, 2019 •

edited

Loading

pick : overly long and high memory usage #207

pick : overly long and high memory usage #207

Comments

fmarletaz commented Aug 8, 2019

lucventurini commented Aug 8, 2019 • edited Loading

fmarletaz commented Aug 9, 2019 via email

lucventurini commented Aug 9, 2019

lucventurini commented Aug 12, 2019

fmarletaz commented Aug 14, 2019

lucventurini commented Aug 14, 2019

lucventurini commented Aug 15, 2019 • edited Loading

lucventurini commented Aug 8, 2019 •

edited

Loading

lucventurini commented Aug 15, 2019 •

edited

Loading