Skip to content

Releases: EI-CoreBioinformatics/mikado

Total multiprocessing

02 Mar 22:37
Compare
Choose a tag to compare

This release brings prepare up to speed with the other parts of the suite; now it is completely multiprocessed.

WARNING Starting from this release, the library for mikado is "Mikado", not "mikado_lib"

Correct boundaries

01 Mar 18:25
Compare
Choose a tag to compare

BF release:

  • Fixed a very nasty bug that led to creating bogus CDSs and split transcripts, especially in the presence of negative strands. This release also contains some tests to prevent backslides.
  • Additionally, now after splitting transcripts Mikado checks that the internal ORFs are coherent with the original transcript.
  • Serialise now relies on a Process rather than Pool implementation.
  • Reworked prepare so as to avoid keeping in memory all the GFF lines - now we keep common information aside and store only the intervals explicitly. This should massively decrease memory usage.
    WARNING: as a result, now Mikado needs the input files to have valid exon entries. If a file only contains CDS/UTR entries, it will be completely ignored.

True multiprocessing

26 Feb 22:18
Compare
Choose a tag to compare

New in this release, which marks a real hallmark:

  • Added TRAVIS testing
  • possibility to choose the preferred multiprocessing start method from the configuration file
  • Added the key "only_confirmed_introns"
  • Moved picker and the new loci_processer module to a new subpackage, "picking"
  • Now AS events have to be valid AS events vs. all transcripts in the locus, not just the primary
  • Faster retrieval of verified introns
  • Bug Fix for printing of BED12 objects
  • Now in multiprocessing mode each process will write to a temporary file; at the end of the run, the files will be merged.
  • Switched to simple Queues instead of Manager-derived ones - EXTREMELY faster.
  • Switched to pyfaidx for ORF loading in the database
  • Now loading first the ORFs then the BLAST hits
  • Now Mikado prepare also keeps the information of the original transcript, if present at all.
  • BLAST files should be opened with the new BlastOpener class; "create_opener" has gone. The class can be used with the "with" statements and therefore prevents the process from having too many opened files at once.
  • Now we output monoloci_scores/metrics and loci_scores/metrics
  • In the loci scores/metrics files, transcripts with more than one ORF are reported multiple times, each time with a different ORF, to allow for better filtering using the provided tables.

After all these modifications, now Mikado pick completes AT in ~13 minutes; Chr1 finishes under 3. The 3B scaffolds in the TGAC assembly finished in less than one hour.

Light connections

23 Feb 02:21
Compare
Choose a tag to compare

Major changes for this version:

  • switched away from clique-based algorithms for finding the communities. By using NetworkX new "connected_components" method, even the toughest loci can be analysed in few seconds;
  • modified the SQL query retrieval for BLAST data, switching away from using the ORM to use direct SQL queries. The result is a massive speedup, allowing for real multiprocessing;
  • bug fix for awk_gtf;
  • Now Mikado pick won't crash if a transcript is invalid, but it will rather emit an error in the log and ignore the felonious record;
  • The reduction heuristics introduced in the previous version are proven useless with the new community finding algorithm, they are factually disabled by setting the threshold to 1000 nodes with the most connected at 1000 edges.

NP reduction

20 Feb 00:23
Compare
Choose a tag to compare

When faced with a complex locus (more than 250 nodes or node with maximal connectivity having more than 200 edges), Mikado now will employ the following algorithm:

  • First approximation: remove all redundant intron chains (ie those completely contained within another compatible intron chain)
  • Second approximation: remove all transcripts completely contained within another (class code c)
  • Third approximation: use the "source" field in the original files to collect transcripts from different sources until the limit is reached.

This ensures that even the most complex loci can be solved relatively quickly and painlessly.

Approximate clique finding

18 Feb 19:24
Compare
Choose a tag to compare

Introduced an approximate method of clique finding for complex loci (>350 transcripts). Now, in such cases Mikado will just try to find iteratively the maximum clique, remove all nodes from the graph, then repeat until the size of the graph is amenable to the classic Bron-Kerbosch algorithm (350 or less). The method is approximate but much faster than the previous implementation, while using only a fraction of the memory.
Complex loci such as these will be flagged with an "approximate" flag at the superlocus level.

Another change is that now mikado serialise is multi-threaded, an advantage which will useful only in cases where multiple BLAST files have been created.

New community finding

17 Feb 17:58
Compare
Choose a tag to compare

Main changes:

  • Switched to the [Reid/Daid/Hurley algorithm|http://arxiv.org/pdf/1205.0038.pdf] for community finding; much more efficient for complex regions.
  • Now compare does not penalize fusions in the refmap.

Lower underscore

15 Feb 16:34
Compare
Choose a tag to compare

Now ccodes of _ indicate a nucleotide F1 of 80%.
Changes in serialise to make it leaner in the reading of the XML files and the FASTA files.

PyFAIDXing and yielding the reference

11 Feb 14:33
Compare
Choose a tag to compare

Changes to mikado prepare, using generators and pyfaidx to make faster and lighter.
Moreover, now Mikado compare returns a richer output in the statistics file.

New tags and badges

02 Feb 13:22
Compare
Choose a tag to compare

Greatest modification: changed the Class Codes, introducing J,C and modifying the meaning of n. This could have repercussions on Pick as well.

Reverted to best bit score for default blast scoring.

Bug fix in the calculation of metrics in sublocus.

Bug fix for SQL delete in the serialisation library.