-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mikado Serialise crashing in BLAST parsing #229
Comments
Dear @ChemaMD, many thanks for reporting this bug. I will have a look at it now. Kind regards |
Thank you very much for such a quick response and solution! I will update and re-run. |
Dear @lucventurini , the update solved the issue! Thank you very much for your help. I'm now with Mikado pick, and it seems to run for long time without actually writing anything in any of the output files. The log gives some warnings and it appears as if the pipeline started checking the database but then got silence. I tried to run the same command with the flag Thank you for your help. ====== Pick log =====
|
Dear @ChemaMD, Best |
I just lunched the following command:
I added the Below the log this run made (it is still running but not writing anything down). I consistently get a WARNING message with ======= PICK LOG ======
|
Dear @ChemaMD ,
My fault, the message should really be INFO level. It is literally logging which transcripts had been padded, nothing more.
In the output folder there should be a temporary folder ( |
Dear @lucventurini , The command did not generate the temporary folder, and it is still running (it's been running for 5h), without loading anything in the Below, the log of the commands I run:
|
Dear @ChemaMD, it is really strange. Mikado pick should be writing to the files inside the temporary directory almost immediately.
This will force mikado to go into a single process and should finish really quickly. The log will have a lot of information we can use to understand what is going on - especially if indeed mikado will still not finish even with a reduced GTF. |
Hi @lucventurini , it now worked well with just 10,000 lines gtf. I attached the |
- report the padding as INFO, not as WARNING - report on finishing the **analysis** of a chromosome, not the parsing - report the temporary analysis directory
Dear @ChemaMD , OK this is progress! May I ask you then to:
If it functions, then we should try with a whole scaffold/chromosome (probably the smallest would be the safest bet), again starting with the Also, I did modify the code of Mikado (see 36567b0 for details, but please download 9da88d4) so that now in the log it reports when it has actually finished to analyse a locus, while in multiprocessing mode. In single processing mode, the messages were already informative (ie when it says it has finished a chromosome, it has finished analysing it). When launching with multiple processes, inside the temporary folder, there should be a I am honestly still unclear on what is going on, but hopefully through this debugging we can solve the issue. |
Hi @lucventurini , I just tried with the multiprocessors and small fixed GTF and worked well. It created the temporary file. It also worked well with just the smallest chromosome. Also when adding the
|
Excellent. Hopefully the new logging will give you a sense of how it is proceeding with the full dataset. Thank you for nudging me in making those messages more informative!
Yes, well spotted - I changed it recently. Documentation is being updated on a dedicated branch and, once this issue and #226 are fixed, is one of the few remaining roadblocks for finally having an official Mikado2. |
Dear @lucventurini , I launched ==== LOG ====
|
Dear @ChemaMD, no, it is not normal. Mikado should have dumped the transcripts to the Kind regards |
Dear @lucventurini , I started two approaches to overcome this issue and test whether the problem is a complex locus that stalls ===== LOG ====
|
Dear @ChemaMD, thank you, this looks like a proper bug. I will have a look at the relevant code now. Regarding finding the locus that is creating the problem - I have thought of a way. 1- Convert the prepared GTF into a BED12, putting "1" as a score: Please change This should already give us an indication of which loci are massive: it will produce a BED file where the fourth column is the number of transcripts per locus. This should lead us fairly quickly to the culprit. Another possible solution, and this could definitely make a big impact, is to change the parameter I hope this helps. |
PS: @swarbred, would you think that reducing the default value for the maximum intron length (1,000,000) would be a good idea? It seems that keeping it this high is creating problems. I know that there are known introns in the H. sapiens annotation that are as long as 4 megabasepairs, but even there, 99% of the known introns are lower than 100,000 (more precisely, less than 87,000 basepairs). |
…ine option for pick as well.
I would be wary of changing this to 100,000, while their are relatively few transcripts with intron larger than this ~7000 in the human gencode.v29 annotation, excluding them in the prepare will mean remove any chance of annotating these The Histat2 default is 500,000 there are only 105 above this, so that is probably a better choice if you were to make a change. We have run lots of rodent genomes through with no issues (with the default 1,000,000, but based on Histat2 alignments with max 500,000 intron size). |
Good point. Thank you for the comment. I think that in general 500,000 might be a saner option, thinking back to some pathological cases. But I will not pull the trigger yet. |
@lucventurini ok, I started a new mikado run with the lower intron length (50,000; if mikado now runs completely, I might then consider going up to 500,000 or just 100,000). In my case, this is an annelid genome, and there are not many other genomes to use as reference (the best one would be Capitella, and it is half the size of the genome I'm annotating). From the ~4.8M transcripts I loaded in The test running the initial dataset scaffold by scaffold is still on-going, it only managed to finished the three smallest scaffolds. In some of them, there was an error, I think associated with an invalid codon, which didn't kill the run but implied removing the super locus (Log behind).
|
Indeed they might just be Trinity misassemblies, or GMAP misalignments. This would fit the pattern we observed in other datasets.
Yes indeed. This is an oversight on my part, I will fix it immediately. Thank you again for reporting! EDIT: fixed in a278687 |
@ChemaMD Just to comment for an Annelid genome 50,000 max intron size should be fine, the ~80,000 transcript removed will overwhelmingly be alignment artefacts |
Thanks @swarbred and @lucventurini for the feedback and help. The single scaffold runs are progressively finishing, and based on the size of these scaffolds, it appears that |
@ChemaMD, on our A. thaliana test dataset, mikado did approximately 45Mbps/hour per CPU. However, the dataset had about 20 times less transcripts, at about 200, 000. Your observation is therefore in line with what I would expect - mikado should scale almost linearly with transcript numbers and transcript density, and that seems to be the case here. Throwing more CPU at the problem should help, as the task is completely parallelised. |
Thanks @lucventurini ! I will continue with the run with the limited intron size and submit |
Not a problem! Out of curiosity, how big is the genome you are analysing? I had a quick look and the best I could find seemed to indicate a genome size between 250 Mbps and 1.2 Gbps for Anellida. No particular reason, just curious about the organism (and admittedly, about whether my back of the envelope calculations on mikado performance are really in the right ballpark). |
The genome is ~550Mb, mid range for an annelid. We will have some others closer to the 1Gb upper range. I did the BED12 calculation for Scaffold01 (of ~44Mb), and there seems to be a super-locus that spans almost the entire chromosome and of course with hundreds of thousands of transcripts:
|
Yep, that would nicely break Mikado's parallelisation - all those hundreds of thousands of transcripts would have been analysed by a single process. |
Dear @lucventurini , |
Excellent news! Thank you for letting us know. I will close the issue, and we can discuss further on Wednesday. |
* Fix for #229 - now `mikado pick` will: - report the padding as INFO, not as WARNING - report on finishing the **analysis** of a chromosome, not the parsing - report the temporary analysis directory * #229: added "--max-intron-length" as a command line option for pick as well. * #229: solve the issue with (partially?) lowercase codons
* Fix for EI-CoreBioinformatics#229 - now `mikado pick` will: - report the padding as INFO, not as WARNING - report on finishing the **analysis** of a chromosome, not the parsing - report the temporary analysis directory * EI-CoreBioinformatics#229: added "--max-intron-length" as a command line option for pick as well. * EI-CoreBioinformatics#229: solve the issue with (partially?) lowercase codons
Hi,
Thank you very much for developing such a useful piece of software. I'm running the latest version of Mikado (2) and I'm encountering a parsing problem with an XML diamond BLAST file. I'm running serialise in a cluster employing a single core (I have a single XML file, and from the help info of mikado serialise, it seems I cannot use more CPUs than XML files) and 15GB of RAM. It is apparently crashing when parsing queries that hit a same target "Q9NBX4" (there are 8,001 hits in the XML, it appears to be a transposon). I have attached the logs. Any idea how to overcome this issue?
Thank you very much for your help.
Chema
====== SERIALISE.LOG ======
====== STDOUPUT ======
The text was updated successfully, but these errors were encountered: