Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pick : overly long and high memory usage #207

Closed
fmarletaz opened this issue Aug 8, 2019 · 7 comments
Closed

pick : overly long and high memory usage #207

fmarletaz opened this issue Aug 8, 2019 · 7 comments
Assignees
Projects
Milestone

Comments

@fmarletaz
Copy link

Hi - thanks for developing such a great tool mikado. I have been running it on several genomes without problem. However, for a recent project, I have some issues at the final mikado pick stage which takes a very long time (several days) and seems to require >100Gb of memory when using 8 cores. The problem might be that the RNA-seq data are relatively noisy (a lot of low coverage non-coding transcripts) that I tried to filter out. Is there any parameter that I could tweak to fix that? Any suggestions. I am happy to share the log if necessary. Thanks again!

@lucventurini lucventurini self-assigned this Aug 8, 2019
@lucventurini lucventurini added this to the 2.0 milestone Aug 8, 2019
@lucventurini
Copy link
Collaborator

lucventurini commented Aug 8, 2019

Hi @fmarletaz , thank you for reporting the issue. We recently had a similar issue that was caused by excessive locus chaining. Basically what happened was that there was a relatively low number of transcripts with enormous introns and/or exons (e.g. over a megabasepair long!). This caused Mikado to gather hundreds of thousands of transcripts into a single chunk, which drove up memory and runtime to insane levels (Mikado was written to deal with hundreds or at most few thousands of transcripts per chunk, and in particular the graph algorithms we are using can scale up poorly).

In order to prevent small numbers of rogue transcripts to crash the program, we recently implemented a new parameter, "max_intron", for mikado prepare (see issue #186, in particular here, and commit a131b94). This change is implemented already in 2.0rc3.

There are multiple ways of checking whether this is the case; a possible way for example would be to run gffread -i 100000 -T -o mikado_prepared.GFFREAD.gtf mikado_prepared.gtf and see how many transcripts get discarded (-i 100000 tells gffread to discard any transcript with introns larger than 100kbps, -T for GTF output). You could then give the filtered GTF to mikado pick.

If this does not solve the problem, please let us know as soon as possible, so that we can investigate further!

@fmarletaz
Copy link
Author

fmarletaz commented Aug 9, 2019 via email

@lucventurini
Copy link
Collaborator

I ran the command you suggested, which removed 13,939 transcript from the mikado_prepared.gtf(original number 943110). Do you think that would be enough long transcripts to cause a problem?

Yes, it most definitely would. Mikado was written with a partitioning algorithm that is based solely on transcripts' coordinates - so that even a handful of rogue transcripts like those will cause Mikado to gobble too many transcripts in a single locus, with the consequences you saw.

I will try the options you suggested (1) try mikado 2.0r3 with max_intron and (2) try mikado pick with the filtered gtf.

Great! Please let us know how it goes.

Btw, I remember we talked perviously about a script to use mikado transcripts to add UTRs to Augustus generated gene-models, do you plan to release it soon?

So, this is a bit awkward ... Mikado should now have all the necessary infrastructure to perform this operation, but we have not written a straightforward way to perform it. And actually, it is not completely possible to do so due to some restrictions in the configuration.

We have started to work on this--mikado pick for example now has a --only-reference-update key, which means that it will discard any locus that only contains non-reference transcripts, ie. in your case, non-Augustus transcripts--but it is not finished yet.

@lucventurini
Copy link
Collaborator

Hi @fmarletaz , I had quite a sprint during the weekend. In the latest commit (26adc8b) I have changed the way that Mikado calculates the initial transcript graph. The change should make Mikado much more robust to huge loci, as I think that the bottleneck was the creation of the initial transcript graph. Hopefully in the future this should help solve problems like those you found yourself.

@fmarletaz
Copy link
Author

Thanks a lot! I am going to try it!

@lucventurini
Copy link
Collaborator

OK! I am still experimenting on this as well - I think I have seen some improvement but it is still work in progress. Namely, now Mikado will try to prune relatively aggressively a locus if there are too many transcripts. Please follow branch 207 for updates to the code!

lucventurini added a commit to lucventurini/mikado that referenced this issue Aug 14, 2019
…reation and checking to the subprocesses. This should speed it up a lot in the presence of complex, long loci.
@lucventurini
Copy link
Collaborator

lucventurini commented Aug 15, 2019

Dear @fmarletaz , now changes have been merged into master.
Briefly, now Mikado pick will be much more efficient in using multiple processes, and it will detect during the run whether any transcript has an intron which is too long (default is 1 Mbps; according to ENCODE, you might lose some genuine examples, but I would argue that for the vast, vast majority of cases this is sensible).

With these changes, Mikado was able to analyse >1 million transcripts in about one hour, with a peak memory of 11GB, using 32 parallel processes. I think this is a reasonable performance for my program, given that it is not written in Python rather than a more performant language (eg C++) and the fact that there are disk bottlenecks (querying the database).

As such I will close the issue down for now.

@lucventurini lucventurini added this to Closed in Version 2 Oct 15, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
Development

No branches or pull requests

2 participants