Improvements to Mikado compare #166

lucventurini · 2019-04-02T09:11:38Z

Mikado compare has been going under a number of revisions for 1.5. Specifically:

added multiprocessing
added the ability of analysing BAM predictions (for e.g. Nanopore alignments)
Internally, the cython routine now compares C-level interval trees rather than Python-level sets - increasing speed and allowing for more refined matching. Specifically:
- added the possibility of specifying a "fuzzy match" for introns - basically, considering an intron as a match if the boundaries are within a certain distance.
In the final statistics, Mikado compare now reports both redundant and non-redundant intron chain statistics

However, the improvements are not finished yet. Specifically:

multiprocessing has increased the memory usage, most probably because data is held in memory before being returned to the main process. This has to be fixed before release
The "fuzzy matching" only functions at the level of the single match, but not at the level of the final statistics. This might engender confusion.
As we added redundant intron-chain statistics, so we should do as well for the intron level
We should add splice-level statistics as well.

swarbred · 2019-04-03T09:41:18Z

@lucventurini looks like you added the redundant / non-redundant option for introns, I see you made some additional changes yesterday. Is this in a state we should test/use or should we hold off for now i.e. if you were planning to make further changes shortly.

lucventurini · 2019-04-03T11:13:55Z

I am still making changes, please hold off for now. I need to verify that the multiprocessing is implemented correctly.

I will let you know when it's ready to be trialled!

…-loading the index into memory (#166).

lucventurini · 2019-04-03T17:57:24Z

After verifying with junctools, it is apparent that junctions statistics need to be heavily investigated. This takes priority.

…ch global statistics are still wrong.

lucventurini · 2019-04-04T18:58:01Z

Hi @swarbred, the latest commit in issue-166 should be ready for testing. The only outstanding part missing is making sure that the statistics for the introns are not taking into account the fuzziness.

Memory usage should be much lower now as I make full usage of the sqlite database rather than preloading the full index in memory. This is particularly important for multiprocessing.

Please test and have a look at the results!

lucventurini · 2019-04-05T18:46:08Z

The commits from today modify somewhat the Cython classes and utilities that underpin Mikado compare. At the moment the laundry list looks like this:

multiprocessing has increased the memory usage, most probably because data is held in memory before being returned to the main process. This has to be fixed before release
This should be fixed
The "fuzzy matching" only functions at the level of the single match, but not at the level of the final statistics. This might engender confusion.
In progress. Today I was able to implement this for the intron chains.
As we added redundant intron-chain statistics, so we should do as well for the intron level
In progress
We should add splice-level statistics as well.
In progress

lucventurini · 2019-04-17T06:58:36Z

A problem with multiprocessing is the high amount of memory required and the long startup time. This in turn is due to preloading the index in memory for each subprocess.

This strategy is quite slow and wasteful, especially for long indices. A better method is required.

…data in the children processes

…olishing for compare (#166)

… compare when testing in the all vs all (issue #166).

lucventurini · 2019-05-28T12:51:13Z

Currently, the only outstanding issue is the following:

The "fuzzy matching" only functions at the level of the single match, but not at the level of the final statistics. This might engender confusion.

@cschuh

* Switched to PySam for loading and fetching from genome files. Also, improved massively the speed of tests. * Fixed previous commit * Fixed travis bug * Refactoring of check_index for Mikado compare (#166) and fix for #172 * Now Mikado will merge touching (NOT overlapping) exons coming from BED12 files. This should fix an issue with halLiftover * This commit should fix a bunch of tests for when Mikado is installed with SUDO privileges (#137) potentially also fixing #172. * Corrected a bug in the printing of transcriptomic BED12 files, corrected a bug in the serialisation of ORFs * Fixed previous breakage * Moved the code for checking the index into gene_dict. Also, now GeneDict allows access to positions as well. * Minor edit to assigner * Fixing previously broken commit * Solving a bug which rendered the exclude_utr/protein_coding flags of mikado compare useless. * Adding the GZI index to the tests directory to avoid permission errors. Addressing #175 * Corrected some testing. Moreover, now Mikado supports the BED12+1 format (ie gffread --bed output) * Adding a maximum intron length for the default scoring configuration files. * BROKEN. Proceeding on #142. Now the padding algorithm is aware of where a transcript finishes (intron vs exon). Moreover, we need to change the data structure for padding to a *directional* graph and keep in mind the distance needed to pad a transcript, to solve ambiguous cases in a deterministic (rather than random) way. * Issue #174: modification to the abstractlocus.py file, to try to solve the issue found by @cschuh. * #174: this should provide a solution to the issue, which is however only temporary. To be tested. * #174: making the implicit "for" cycle explicit. Hopefully this should help pinpoint the error better. * #174: peppered the failing block with try-except statements. * #174: this should solve it. Now missing external scores in the database will cause Mikado to explicitly fail. * Fixed #176 * BROKEN. Progress on #142, the code runs, but the tests are broken. **This might be legitimate as we changed the behaviour of the code.** * Closing #155. * #174: Now Mikado pick will die informatively if the SQLite3 database has not been found. * #166: fixed some issues with self-compare * BROKEN. We have to verify that the padding functions also on the 5' end, but we need to make a new test for that. The test development is in progress. * The padding now should be tested and correct. * Fixed previous commit. This should fix #142.

lucventurini · 2019-06-06T18:14:58Z

Moving the last point to the next release. Closing this issue.

@cschuh

* Solved a small bug in the Gene class * This commit should fix some of the performance issues found in Mikado compare when testing in the all vs all (issue #166). * Updated the CHANGELOG. * Slight improvements to the generic GFLine class and to the to_gff wrapper * Solved some assorted bugs, from stop_codon parsing in GTF2 (for Augustus) to avoiding a very costly pragma check on MIDX databases. * Now Mikado util stats will only return one value for the mode, making the table parsable * Solved some small bugs introduced by changing the mode for mikado util stats * Dropping automated support for Python3.5. The conda environment cannot be created successfully, too many packages have not been updated in the original repositories. * Updating the conda environment to reflect that only Python>=3.6 is now accepted * Various fixes for managing correctly BED12 files. * Fix for the previous commit breaking TRAVIS * Switched to PySam for loading and fetching from genome files. Also, improved massively the speed of tests. * Fixed previous commit * Fixed travis bug * Refactoring of check_index for Mikado compare (#166) and fix for #172 * Now Mikado will merge touching (NOT overlapping) exons coming from BED12 files. This should fix an issue with halLiftover * This commit should fix a bunch of tests for when Mikado is installed with SUDO privileges (#137) potentially also fixing #172. * Corrected a bug in the printing of transcriptomic BED12 files, corrected a bug in the serialisation of ORFs * Fixed previous breakage * Moved the code for checking the index into gene_dict. Also, now GeneDict allows access to positions as well. * Minor edit to assigner * Fixing previously broken commit * Solving a bug which rendered the exclude_utr/protein_coding flags of mikado compare useless. * Adding the GZI index to the tests directory to avoid permission errors. Addressing #175 * Corrected some testing. Moreover, now Mikado supports the BED12+1 format (ie gffread --bed output) * Adding a maximum intron length for the default scoring configuration files. * BROKEN. Proceeding on #142. Now the padding algorithm is aware of where a transcript finishes (intron vs exon). Moreover, we need to change the data structure for padding to a *directional* graph and keep in mind the distance needed to pad a transcript, to solve ambiguous cases in a deterministic (rather than random) way. * Issue #174: modification to the abstractlocus.py file, to try to solve the issue found by @cschuh. * #174: this should provide a solution to the issue, which is however only temporary. To be tested. * #174: making the implicit "for" cycle explicit. Hopefully this should help pinpoint the error better. * #174: peppered the failing block with try-except statements. * #174: this should solve it. Now missing external scores in the database will cause Mikado to explicitly fail. * Fixed #176 * BROKEN. Progress on #142, the code runs, but the tests are broken. **This might be legitimate as we changed the behaviour of the code.** * Closing #155. * #174: Now Mikado pick will die informatively if the SQLite3 database has not been found. * #166: fixed some issues with self-compare * BROKEN. We have to verify that the padding functions also on the 5' end, but we need to make a new test for that. The test development is in progress. * The padding now should be tested and correct. * Fixed previous commit. This should fix #142. * Development (#178) * Switched to PySam for loading and fetching from genome files. Also, improved massively the speed of tests. * Fixed previous commit * Fixed travis bug * Refactoring of check_index for Mikado compare (#166) and fix for #172 * Now Mikado will merge touching (NOT overlapping) exons coming from BED12 files. This should fix an issue with halLiftover * This commit should fix a bunch of tests for when Mikado is installed with SUDO privileges (#137) potentially also fixing #172. * Corrected a bug in the printing of transcriptomic BED12 files, corrected a bug in the serialisation of ORFs * Fixed previous breakage * Moved the code for checking the index into gene_dict. Also, now GeneDict allows access to positions as well. * Minor edit to assigner * Fixing previously broken commit * Solving a bug which rendered the exclude_utr/protein_coding flags of mikado compare useless. * Adding the GZI index to the tests directory to avoid permission errors. Addressing #175 * Corrected some testing. Moreover, now Mikado supports the BED12+1 format (ie gffread --bed output) * Adding a maximum intron length for the default scoring configuration files. * BROKEN. Proceeding on #142. Now the padding algorithm is aware of where a transcript finishes (intron vs exon). Moreover, we need to change the data structure for padding to a *directional* graph and keep in mind the distance needed to pad a transcript, to solve ambiguous cases in a deterministic (rather than random) way. * Issue #174: modification to the abstractlocus.py file, to try to solve the issue found by @cschuh. * #174: this should provide a solution to the issue, which is however only temporary. To be tested. * #174: making the implicit "for" cycle explicit. Hopefully this should help pinpoint the error better. * #174: peppered the failing block with try-except statements. * #174: this should solve it. Now missing external scores in the database will cause Mikado to explicitly fail. * Fixed #176 * BROKEN. Progress on #142, the code runs, but the tests are broken. **This might be legitimate as we changed the behaviour of the code.** * Closing #155. * #174: Now Mikado pick will die informatively if the SQLite3 database has not been found. * #166: fixed some issues with self-compare * BROKEN. We have to verify that the padding functions also on the 5' end, but we need to make a new test for that. The test development is in progress. * The padding now should be tested and correct. * Fixed previous commit. This should fix #142. * Update Singularity.centos.def Changed python to python3 during %post, otherwise it will use the system python2.7... * Fixed small bug in external metrics handling * Update Singularity.centos.def * Development (#184) * This should address #173 (both configuration file and docs) and #158 * Fix #181 and small bug fix for parsing Mikado annotations. * Progress for #142 - this should fix the wrong ORF calculation for cases when the CDS was open at the 5' end. * Fixed previous commit (always for #142) * #142: corrected and tested the issue with one-off exons, for padding. * This should fix and test #142 for good. * Removed spurious warning/error messages * #142: solved a bug which caused truncated transcripts at the 5' end not to be padded. * #142: solved a problem which caused a false abort for transcripts on the - strand with changed stop codon. * #142: fixing previous commit * Pushing the fix for #182 onto the development branch * Fix #183 * Fix #183 and previous commit * #183: now Mikado configure will set a seed when generating the configuration file. The seed will be explicitly mentioned in the log. * #177: made ORF loading slightly faster with pysam. Also made XML serialisation much faster using SQL sessions and multiprocessing.Pool instead of queues. * Solved annoying bug that caused Mikado to crash with TAIR GFF3s.

@cschuh

* Solved a small bug in the Gene class * This commit should fix some of the performance issues found in Mikado compare when testing in the all vs all (issue #166). * Updated the CHANGELOG. * Slight improvements to the generic GFLine class and to the to_gff wrapper * Solved some assorted bugs, from stop_codon parsing in GTF2 (for Augustus) to avoiding a very costly pragma check on MIDX databases. * Now Mikado util stats will only return one value for the mode, making the table parsable * Solved some small bugs introduced by changing the mode for mikado util stats * Dropping automated support for Python3.5. The conda environment cannot be created successfully, too many packages have not been updated in the original repositories. * Updating the conda environment to reflect that only Python>=3.6 is now accepted * Various fixes for managing correctly BED12 files. * Fix for the previous commit breaking TRAVIS * Switched to PySam for loading and fetching from genome files. Also, improved massively the speed of tests. * Fixed previous commit * Fixed travis bug * Refactoring of check_index for Mikado compare (#166) and fix for #172 * Now Mikado will merge touching (NOT overlapping) exons coming from BED12 files. This should fix an issue with halLiftover * This commit should fix a bunch of tests for when Mikado is installed with SUDO privileges (#137) potentially also fixing #172. * Corrected a bug in the printing of transcriptomic BED12 files, corrected a bug in the serialisation of ORFs * Fixed previous breakage * Moved the code for checking the index into gene_dict. Also, now GeneDict allows access to positions as well. * Minor edit to assigner * Fixing previously broken commit * Solving a bug which rendered the exclude_utr/protein_coding flags of mikado compare useless. * Adding the GZI index to the tests directory to avoid permission errors. Addressing #175 * Corrected some testing. Moreover, now Mikado supports the BED12+1 format (ie gffread --bed output) * Adding a maximum intron length for the default scoring configuration files. * BROKEN. Proceeding on #142. Now the padding algorithm is aware of where a transcript finishes (intron vs exon). Moreover, we need to change the data structure for padding to a *directional* graph and keep in mind the distance needed to pad a transcript, to solve ambiguous cases in a deterministic (rather than random) way. * Issue #174: modification to the abstractlocus.py file, to try to solve the issue found by @cschuh. * #174: this should provide a solution to the issue, which is however only temporary. To be tested. * #174: making the implicit "for" cycle explicit. Hopefully this should help pinpoint the error better. * #174: peppered the failing block with try-except statements. * #174: this should solve it. Now missing external scores in the database will cause Mikado to explicitly fail. * Fixed #176 * BROKEN. Progress on #142, the code runs, but the tests are broken. **This might be legitimate as we changed the behaviour of the code.** * Closing #155. * #174: Now Mikado pick will die informatively if the SQLite3 database has not been found. * #166: fixed some issues with self-compare * BROKEN. We have to verify that the padding functions also on the 5' end, but we need to make a new test for that. The test development is in progress. * The padding now should be tested and correct. * Fixed previous commit. This should fix #142. * Development (#178) * Switched to PySam for loading and fetching from genome files. Also, improved massively the speed of tests. * Fixed previous commit * Fixed travis bug * Refactoring of check_index for Mikado compare (#166) and fix for #172 * Now Mikado will merge touching (NOT overlapping) exons coming from BED12 files. This should fix an issue with halLiftover * This commit should fix a bunch of tests for when Mikado is installed with SUDO privileges (#137) potentially also fixing #172. * Corrected a bug in the printing of transcriptomic BED12 files, corrected a bug in the serialisation of ORFs * Fixed previous breakage * Moved the code for checking the index into gene_dict. Also, now GeneDict allows access to positions as well. * Minor edit to assigner * Fixing previously broken commit * Solving a bug which rendered the exclude_utr/protein_coding flags of mikado compare useless. * Adding the GZI index to the tests directory to avoid permission errors. Addressing #175 * Corrected some testing. Moreover, now Mikado supports the BED12+1 format (ie gffread --bed output) * Adding a maximum intron length for the default scoring configuration files. * BROKEN. Proceeding on #142. Now the padding algorithm is aware of where a transcript finishes (intron vs exon). Moreover, we need to change the data structure for padding to a *directional* graph and keep in mind the distance needed to pad a transcript, to solve ambiguous cases in a deterministic (rather than random) way. * Issue #174: modification to the abstractlocus.py file, to try to solve the issue found by @cschuh. * #174: this should provide a solution to the issue, which is however only temporary. To be tested. * #174: making the implicit "for" cycle explicit. Hopefully this should help pinpoint the error better. * #174: peppered the failing block with try-except statements. * #174: this should solve it. Now missing external scores in the database will cause Mikado to explicitly fail. * Fixed #176 * BROKEN. Progress on #142, the code runs, but the tests are broken. **This might be legitimate as we changed the behaviour of the code.** * Closing #155. * #174: Now Mikado pick will die informatively if the SQLite3 database has not been found. * #166: fixed some issues with self-compare * BROKEN. We have to verify that the padding functions also on the 5' end, but we need to make a new test for that. The test development is in progress. * The padding now should be tested and correct. * Fixed previous commit. This should fix #142. * Update Singularity.centos.def Changed python to python3 during %post, otherwise it will use the system python2.7... * Fixed small bug in external metrics handling * Update Singularity.centos.def * This should address #173 (both configuration file and docs) and #158 * Fix #181 and small bug fix for parsing Mikado annotations. * Progress for #142 - this should fix the wrong ORF calculation for cases when the CDS was open at the 5' end. * Fixed previous commit (always for #142) * #142: corrected and tested the issue with one-off exons, for padding. * This should fix and test #142 for good. * Removed spurious warning/error messages * #142: solved a bug which caused truncated transcripts at the 5' end not to be padded. * #142: solved a problem which caused a false abort for transcripts on the - strand with changed stop codon. * #142: fixing previous commit * Pushing the fix for #182 onto the development branch * Fix #183 * Fix #183 and previous commit * #183: now Mikado configure will set a seed when generating the configuration file. The seed will be explicitly mentioned in the log. * #177: made ORF loading slightly faster with pysam. Also made XML serialisation much faster using SQL sessions and multiprocessing.Pool instead of queues. * Solved annoying bug that caused Mikado to crash with TAIR GFF3s. * Development (#184) * This should address #173 (both configuration file and docs) and #158 * Fix #181 and small bug fix for parsing Mikado annotations. * Progress for #142 - this should fix the wrong ORF calculation for cases when the CDS was open at the 5' end. * Fixed previous commit (always for #142) * #142: corrected and tested the issue with one-off exons, for padding. * This should fix and test #142 for good. * Removed spurious warning/error messages * #142: solved a bug which caused truncated transcripts at the 5' end not to be padded. * #142: solved a problem which caused a false abort for transcripts on the - strand with changed stop codon. * #142: fixing previous commit * Pushing the fix for #182 onto the development branch * Fix #183 * Fix #183 and previous commit * #183: now Mikado configure will set a seed when generating the configuration file. The seed will be explicitly mentioned in the log. * #177: made ORF loading slightly faster with pysam. Also made XML serialisation much faster using SQL sessions and multiprocessing.Pool instead of queues. * Solved annoying bug that caused Mikado to crash with TAIR GFF3s.

…oinformatics#166)

…EI-CoreBioinformatics#166)

…-loading the index into memory (EI-CoreBioinformatics#166).

…owever, the fuzzy-match global statistics are still wrong.

… avoiding to preload data in the children processes

…us improvements and polishing for compare (EI-CoreBioinformatics#166)

@cschuh

* Solved a small bug in the Gene class * This commit should fix some of the performance issues found in Mikado compare when testing in the all vs all (issue EI-CoreBioinformatics#166). * Updated the CHANGELOG. * Slight improvements to the generic GFLine class and to the to_gff wrapper * Solved some assorted bugs, from stop_codon parsing in GTF2 (for Augustus) to avoiding a very costly pragma check on MIDX databases. * Now Mikado util stats will only return one value for the mode, making the table parsable * Solved some small bugs introduced by changing the mode for mikado util stats * Dropping automated support for Python3.5. The conda environment cannot be created successfully, too many packages have not been updated in the original repositories. * Updating the conda environment to reflect that only Python>=3.6 is now accepted * Various fixes for managing correctly BED12 files. * Fix for the previous commit breaking TRAVIS * Switched to PySam for loading and fetching from genome files. Also, improved massively the speed of tests. * Fixed previous commit * Fixed travis bug * Refactoring of check_index for Mikado compare (EI-CoreBioinformatics#166) and fix for EI-CoreBioinformatics#172 * Now Mikado will merge touching (NOT overlapping) exons coming from BED12 files. This should fix an issue with halLiftover * This commit should fix a bunch of tests for when Mikado is installed with SUDO privileges (EI-CoreBioinformatics#137) potentially also fixing EI-CoreBioinformatics#172. * Corrected a bug in the printing of transcriptomic BED12 files, corrected a bug in the serialisation of ORFs * Fixed previous breakage * Moved the code for checking the index into gene_dict. Also, now GeneDict allows access to positions as well. * Minor edit to assigner * Fixing previously broken commit * Solving a bug which rendered the exclude_utr/protein_coding flags of mikado compare useless. * Adding the GZI index to the tests directory to avoid permission errors. Addressing EI-CoreBioinformatics#175 * Corrected some testing. Moreover, now Mikado supports the BED12+1 format (ie gffread --bed output) * Adding a maximum intron length for the default scoring configuration files. * BROKEN. Proceeding on EI-CoreBioinformatics#142. Now the padding algorithm is aware of where a transcript finishes (intron vs exon). Moreover, we need to change the data structure for padding to a *directional* graph and keep in mind the distance needed to pad a transcript, to solve ambiguous cases in a deterministic (rather than random) way. * Issue EI-CoreBioinformatics#174: modification to the abstractlocus.py file, to try to solve the issue found by @cschuh. * EI-CoreBioinformatics#174: this should provide a solution to the issue, which is however only temporary. To be tested. * EI-CoreBioinformatics#174: making the implicit "for" cycle explicit. Hopefully this should help pinpoint the error better. * EI-CoreBioinformatics#174: peppered the failing block with try-except statements. * EI-CoreBioinformatics#174: this should solve it. Now missing external scores in the database will cause Mikado to explicitly fail. * Fixed EI-CoreBioinformatics#176 * BROKEN. Progress on EI-CoreBioinformatics#142, the code runs, but the tests are broken. **This might be legitimate as we changed the behaviour of the code.** * Closing EI-CoreBioinformatics#155. * EI-CoreBioinformatics#174: Now Mikado pick will die informatively if the SQLite3 database has not been found. * EI-CoreBioinformatics#166: fixed some issues with self-compare * BROKEN. We have to verify that the padding functions also on the 5' end, but we need to make a new test for that. The test development is in progress. * The padding now should be tested and correct. * Fixed previous commit. This should fix EI-CoreBioinformatics#142. * Development (EI-CoreBioinformatics#178) * Switched to PySam for loading and fetching from genome files. Also, improved massively the speed of tests. * Fixed previous commit * Fixed travis bug * Refactoring of check_index for Mikado compare (EI-CoreBioinformatics#166) and fix for EI-CoreBioinformatics#172 * Now Mikado will merge touching (NOT overlapping) exons coming from BED12 files. This should fix an issue with halLiftover * This commit should fix a bunch of tests for when Mikado is installed with SUDO privileges (EI-CoreBioinformatics#137) potentially also fixing EI-CoreBioinformatics#172. * Corrected a bug in the printing of transcriptomic BED12 files, corrected a bug in the serialisation of ORFs * Fixed previous breakage * Moved the code for checking the index into gene_dict. Also, now GeneDict allows access to positions as well. * Minor edit to assigner * Fixing previously broken commit * Solving a bug which rendered the exclude_utr/protein_coding flags of mikado compare useless. * Adding the GZI index to the tests directory to avoid permission errors. Addressing EI-CoreBioinformatics#175 * Corrected some testing. Moreover, now Mikado supports the BED12+1 format (ie gffread --bed output) * Adding a maximum intron length for the default scoring configuration files. * BROKEN. Proceeding on EI-CoreBioinformatics#142. Now the padding algorithm is aware of where a transcript finishes (intron vs exon). Moreover, we need to change the data structure for padding to a *directional* graph and keep in mind the distance needed to pad a transcript, to solve ambiguous cases in a deterministic (rather than random) way. * Issue EI-CoreBioinformatics#174: modification to the abstractlocus.py file, to try to solve the issue found by @cschuh. * EI-CoreBioinformatics#174: this should provide a solution to the issue, which is however only temporary. To be tested. * EI-CoreBioinformatics#174: making the implicit "for" cycle explicit. Hopefully this should help pinpoint the error better. * EI-CoreBioinformatics#174: peppered the failing block with try-except statements. * EI-CoreBioinformatics#174: this should solve it. Now missing external scores in the database will cause Mikado to explicitly fail. * Fixed EI-CoreBioinformatics#176 * BROKEN. Progress on EI-CoreBioinformatics#142, the code runs, but the tests are broken. **This might be legitimate as we changed the behaviour of the code.** * Closing EI-CoreBioinformatics#155. * EI-CoreBioinformatics#174: Now Mikado pick will die informatively if the SQLite3 database has not been found. * EI-CoreBioinformatics#166: fixed some issues with self-compare * BROKEN. We have to verify that the padding functions also on the 5' end, but we need to make a new test for that. The test development is in progress. * The padding now should be tested and correct. * Fixed previous commit. This should fix EI-CoreBioinformatics#142. * Update Singularity.centos.def Changed python to python3 during %post, otherwise it will use the system python2.7... * Fixed small bug in external metrics handling * Update Singularity.centos.def * Development (EI-CoreBioinformatics#184) * This should address EI-CoreBioinformatics#173 (both configuration file and docs) and EI-CoreBioinformatics#158 * Fix EI-CoreBioinformatics#181 and small bug fix for parsing Mikado annotations. * Progress for EI-CoreBioinformatics#142 - this should fix the wrong ORF calculation for cases when the CDS was open at the 5' end. * Fixed previous commit (always for EI-CoreBioinformatics#142) * EI-CoreBioinformatics#142: corrected and tested the issue with one-off exons, for padding. * This should fix and test EI-CoreBioinformatics#142 for good. * Removed spurious warning/error messages * EI-CoreBioinformatics#142: solved a bug which caused truncated transcripts at the 5' end not to be padded. * EI-CoreBioinformatics#142: solved a problem which caused a false abort for transcripts on the - strand with changed stop codon. * EI-CoreBioinformatics#142: fixing previous commit * Pushing the fix for EI-CoreBioinformatics#182 onto the development branch * Fix EI-CoreBioinformatics#183 * Fix EI-CoreBioinformatics#183 and previous commit * EI-CoreBioinformatics#183: now Mikado configure will set a seed when generating the configuration file. The seed will be explicitly mentioned in the log. * EI-CoreBioinformatics#177: made ORF loading slightly faster with pysam. Also made XML serialisation much faster using SQL sessions and multiprocessing.Pool instead of queues. * Solved annoying bug that caused Mikado to crash with TAIR GFF3s.

@cschuh

* Solved a small bug in the Gene class * This commit should fix some of the performance issues found in Mikado compare when testing in the all vs all (issue EI-CoreBioinformatics#166). * Updated the CHANGELOG. * Slight improvements to the generic GFLine class and to the to_gff wrapper * Solved some assorted bugs, from stop_codon parsing in GTF2 (for Augustus) to avoiding a very costly pragma check on MIDX databases. * Now Mikado util stats will only return one value for the mode, making the table parsable * Solved some small bugs introduced by changing the mode for mikado util stats * Dropping automated support for Python3.5. The conda environment cannot be created successfully, too many packages have not been updated in the original repositories. * Updating the conda environment to reflect that only Python>=3.6 is now accepted * Various fixes for managing correctly BED12 files. * Fix for the previous commit breaking TRAVIS * Switched to PySam for loading and fetching from genome files. Also, improved massively the speed of tests. * Fixed previous commit * Fixed travis bug * Refactoring of check_index for Mikado compare (EI-CoreBioinformatics#166) and fix for EI-CoreBioinformatics#172 * Now Mikado will merge touching (NOT overlapping) exons coming from BED12 files. This should fix an issue with halLiftover * This commit should fix a bunch of tests for when Mikado is installed with SUDO privileges (EI-CoreBioinformatics#137) potentially also fixing EI-CoreBioinformatics#172. * Corrected a bug in the printing of transcriptomic BED12 files, corrected a bug in the serialisation of ORFs * Fixed previous breakage * Moved the code for checking the index into gene_dict. Also, now GeneDict allows access to positions as well. * Minor edit to assigner * Fixing previously broken commit * Solving a bug which rendered the exclude_utr/protein_coding flags of mikado compare useless. * Adding the GZI index to the tests directory to avoid permission errors. Addressing EI-CoreBioinformatics#175 * Corrected some testing. Moreover, now Mikado supports the BED12+1 format (ie gffread --bed output) * Adding a maximum intron length for the default scoring configuration files. * BROKEN. Proceeding on EI-CoreBioinformatics#142. Now the padding algorithm is aware of where a transcript finishes (intron vs exon). Moreover, we need to change the data structure for padding to a *directional* graph and keep in mind the distance needed to pad a transcript, to solve ambiguous cases in a deterministic (rather than random) way. * Issue EI-CoreBioinformatics#174: modification to the abstractlocus.py file, to try to solve the issue found by @cschuh. * EI-CoreBioinformatics#174: this should provide a solution to the issue, which is however only temporary. To be tested. * EI-CoreBioinformatics#174: making the implicit "for" cycle explicit. Hopefully this should help pinpoint the error better. * EI-CoreBioinformatics#174: peppered the failing block with try-except statements. * EI-CoreBioinformatics#174: this should solve it. Now missing external scores in the database will cause Mikado to explicitly fail. * Fixed EI-CoreBioinformatics#176 * BROKEN. Progress on EI-CoreBioinformatics#142, the code runs, but the tests are broken. **This might be legitimate as we changed the behaviour of the code.** * Closing EI-CoreBioinformatics#155. * EI-CoreBioinformatics#174: Now Mikado pick will die informatively if the SQLite3 database has not been found. * EI-CoreBioinformatics#166: fixed some issues with self-compare * BROKEN. We have to verify that the padding functions also on the 5' end, but we need to make a new test for that. The test development is in progress. * The padding now should be tested and correct. * Fixed previous commit. This should fix EI-CoreBioinformatics#142. * Development (EI-CoreBioinformatics#178) * Switched to PySam for loading and fetching from genome files. Also, improved massively the speed of tests. * Fixed previous commit * Fixed travis bug * Refactoring of check_index for Mikado compare (EI-CoreBioinformatics#166) and fix for EI-CoreBioinformatics#172 * Now Mikado will merge touching (NOT overlapping) exons coming from BED12 files. This should fix an issue with halLiftover * This commit should fix a bunch of tests for when Mikado is installed with SUDO privileges (EI-CoreBioinformatics#137) potentially also fixing EI-CoreBioinformatics#172. * Corrected a bug in the printing of transcriptomic BED12 files, corrected a bug in the serialisation of ORFs * Fixed previous breakage * Moved the code for checking the index into gene_dict. Also, now GeneDict allows access to positions as well. * Minor edit to assigner * Fixing previously broken commit * Solving a bug which rendered the exclude_utr/protein_coding flags of mikado compare useless. * Adding the GZI index to the tests directory to avoid permission errors. Addressing EI-CoreBioinformatics#175 * Corrected some testing. Moreover, now Mikado supports the BED12+1 format (ie gffread --bed output) * Adding a maximum intron length for the default scoring configuration files. * BROKEN. Proceeding on EI-CoreBioinformatics#142. Now the padding algorithm is aware of where a transcript finishes (intron vs exon). Moreover, we need to change the data structure for padding to a *directional* graph and keep in mind the distance needed to pad a transcript, to solve ambiguous cases in a deterministic (rather than random) way. * Issue EI-CoreBioinformatics#174: modification to the abstractlocus.py file, to try to solve the issue found by @cschuh. * EI-CoreBioinformatics#174: this should provide a solution to the issue, which is however only temporary. To be tested. * EI-CoreBioinformatics#174: making the implicit "for" cycle explicit. Hopefully this should help pinpoint the error better. * EI-CoreBioinformatics#174: peppered the failing block with try-except statements. * EI-CoreBioinformatics#174: this should solve it. Now missing external scores in the database will cause Mikado to explicitly fail. * Fixed EI-CoreBioinformatics#176 * BROKEN. Progress on EI-CoreBioinformatics#142, the code runs, but the tests are broken. **This might be legitimate as we changed the behaviour of the code.** * Closing EI-CoreBioinformatics#155. * EI-CoreBioinformatics#174: Now Mikado pick will die informatively if the SQLite3 database has not been found. * EI-CoreBioinformatics#166: fixed some issues with self-compare * BROKEN. We have to verify that the padding functions also on the 5' end, but we need to make a new test for that. The test development is in progress. * The padding now should be tested and correct. * Fixed previous commit. This should fix EI-CoreBioinformatics#142. * Update Singularity.centos.def Changed python to python3 during %post, otherwise it will use the system python2.7... * Fixed small bug in external metrics handling * Update Singularity.centos.def * This should address EI-CoreBioinformatics#173 (both configuration file and docs) and EI-CoreBioinformatics#158 * Fix EI-CoreBioinformatics#181 and small bug fix for parsing Mikado annotations. * Progress for EI-CoreBioinformatics#142 - this should fix the wrong ORF calculation for cases when the CDS was open at the 5' end. * Fixed previous commit (always for EI-CoreBioinformatics#142) * EI-CoreBioinformatics#142: corrected and tested the issue with one-off exons, for padding. * This should fix and test EI-CoreBioinformatics#142 for good. * Removed spurious warning/error messages * EI-CoreBioinformatics#142: solved a bug which caused truncated transcripts at the 5' end not to be padded. * EI-CoreBioinformatics#142: solved a problem which caused a false abort for transcripts on the - strand with changed stop codon. * EI-CoreBioinformatics#142: fixing previous commit * Pushing the fix for EI-CoreBioinformatics#182 onto the development branch * Fix EI-CoreBioinformatics#183 * Fix EI-CoreBioinformatics#183 and previous commit * EI-CoreBioinformatics#183: now Mikado configure will set a seed when generating the configuration file. The seed will be explicitly mentioned in the log. * EI-CoreBioinformatics#177: made ORF loading slightly faster with pysam. Also made XML serialisation much faster using SQL sessions and multiprocessing.Pool instead of queues. * Solved annoying bug that caused Mikado to crash with TAIR GFF3s. * Development (EI-CoreBioinformatics#184) * This should address EI-CoreBioinformatics#173 (both configuration file and docs) and EI-CoreBioinformatics#158 * Fix EI-CoreBioinformatics#181 and small bug fix for parsing Mikado annotations. * Progress for EI-CoreBioinformatics#142 - this should fix the wrong ORF calculation for cases when the CDS was open at the 5' end. * Fixed previous commit (always for EI-CoreBioinformatics#142) * EI-CoreBioinformatics#142: corrected and tested the issue with one-off exons, for padding. * This should fix and test EI-CoreBioinformatics#142 for good. * Removed spurious warning/error messages * EI-CoreBioinformatics#142: solved a bug which caused truncated transcripts at the 5' end not to be padded. * EI-CoreBioinformatics#142: solved a problem which caused a false abort for transcripts on the - strand with changed stop codon. * EI-CoreBioinformatics#142: fixing previous commit * Pushing the fix for EI-CoreBioinformatics#182 onto the development branch * Fix EI-CoreBioinformatics#183 * Fix EI-CoreBioinformatics#183 and previous commit * EI-CoreBioinformatics#183: now Mikado configure will set a seed when generating the configuration file. The seed will be explicitly mentioned in the log. * EI-CoreBioinformatics#177: made ORF loading slightly faster with pysam. Also made XML serialisation much faster using SQL sessions and multiprocessing.Pool instead of queues. * Solved annoying bug that caused Mikado to crash with TAIR GFF3s.

lucventurini added bug enhancement labels Apr 2, 2019

lucventurini added this to the 1.5 milestone Apr 2, 2019

lucventurini assigned swarbred, gemygk and lucventurini Apr 2, 2019

lucventurini added a commit that referenced this issue Apr 2, 2019

Now Mikado compare tracks non-redundant statistics as well (#166)

612b035

lucventurini added a commit that referenced this issue Apr 2, 2019

Now Mikado compare uses SQLite dumps. This should reduce memory usage (…

b551846

…#166)

lucventurini added a commit that referenced this issue Apr 2, 2019

Fixed some minor kinks (#166)

c885b30

lucventurini added a commit that referenced this issue Apr 3, 2019

Now Mikado compare seems to function correctly without completely pre…

1b51645

…-loading the index into memory (#166).

lucventurini added a commit that referenced this issue Apr 4, 2019

Now mikado compare functions correctly (#166). However, the fuzzy-mat…

60f56ef

…ch global statistics are still wrong.

lucventurini added a commit that referenced this issue Apr 17, 2019

This should solve the memory problem for #166 by avoiding to preload …

20e7013

…data in the children processes

lucventurini added a commit that referenced this issue Apr 18, 2019

Switched to msgpack for compare (#168) and various improvements and p…

be715ec

…olishing for compare (#166)

lucventurini added a commit that referenced this issue Apr 25, 2019

This commit should fix some of the performance issues found in Mikado…

c5d2a3e

… compare when testing in the all vs all (issue #166).

lucventurini mentioned this issue Apr 30, 2019

This commit should fix some of the performance issues found in Mikado… #170

Merged

lucventurini closed this as completed Jun 6, 2019

lucventurini added this to Closed in Version 2 Oct 15, 2020

lucventurini added a commit to lucventurini/mikado that referenced this issue Feb 11, 2021

Now Mikado compare tracks non-redundant statistics as well (EI-CoreBi…

ac66ba2

…oinformatics#166)

lucventurini added a commit to lucventurini/mikado that referenced this issue Feb 11, 2021

Now Mikado compare uses SQLite dumps. This should reduce memory usage (…

206572a

…EI-CoreBioinformatics#166)

lucventurini added a commit to lucventurini/mikado that referenced this issue Feb 11, 2021

Fixed some minor kinks (EI-CoreBioinformatics#166)

1b77839

lucventurini added a commit to lucventurini/mikado that referenced this issue Feb 11, 2021

Now Mikado compare seems to function correctly without completely pre…

3ddbbde

…-loading the index into memory (EI-CoreBioinformatics#166).

lucventurini added a commit to lucventurini/mikado that referenced this issue Feb 11, 2021

Now mikado compare functions correctly (EI-CoreBioinformatics#166). H…

5a11f37

…owever, the fuzzy-match global statistics are still wrong.

lucventurini added a commit to lucventurini/mikado that referenced this issue Feb 11, 2021

This should solve the memory problem for EI-CoreBioinformatics#166 by…

0a04635

… avoiding to preload data in the children processes

lucventurini added a commit to lucventurini/mikado that referenced this issue Feb 11, 2021

Switched to msgpack for compare (EI-CoreBioinformatics#168) and vario…

cfbbd3c

…us improvements and polishing for compare (EI-CoreBioinformatics#166)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improvements to Mikado compare #166

Improvements to Mikado compare #166

lucventurini commented Apr 2, 2019

swarbred commented Apr 3, 2019

lucventurini commented Apr 3, 2019

lucventurini commented Apr 3, 2019

lucventurini commented Apr 4, 2019

lucventurini commented Apr 5, 2019

lucventurini commented Apr 17, 2019

lucventurini commented May 28, 2019

lucventurini commented Jun 6, 2019

Improvements to Mikado compare #166

Improvements to Mikado compare #166

Comments

lucventurini commented Apr 2, 2019

swarbred commented Apr 3, 2019

lucventurini commented Apr 3, 2019

lucventurini commented Apr 3, 2019

lucventurini commented Apr 4, 2019

lucventurini commented Apr 5, 2019

lucventurini commented Apr 17, 2019

lucventurini commented May 28, 2019

lucventurini commented Jun 6, 2019