Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Smarter transcript padding #142

Closed
lucventurini opened this issue Nov 5, 2018 · 13 comments
Closed

Smarter transcript padding #142

lucventurini opened this issue Nov 5, 2018 · 13 comments
Assignees
Projects
Milestone

Comments

@lucventurini
Copy link
Collaborator

At the moment Mikado performs a relatively stupid expansion: it checks whether the end of the other transcript is downstream of the one we are observing, and if it is, it will just expand the transcript until the end. However, this means crossing all the potential introns downstream, and therefore, result in a transcript with one or more retained introns (by definition).

There are four broad cases, that I can think of, and that we are treating in the same way.
Notation: “A” is the transcript to be expanded, “B” the template. For simplicity, they are both on the “+” strand, and we are expanding their 3’ end. The reasoning would be analogous if we were reasoning in any of the other three potential orientations (“+” on the 5’ end, etc.)

  • Exon vs exon. The last exon of transcript A ends within the last exon of transcript B.
    • Current algorithm: expand until end of B.
    • Proposed change: None.
  • Terminal exon vs. terminal intron. The last exon of transcript A ends within the last intron of transcript B.
    • Current algorithm: expand until end of B.
    • Proposed change: None.
  • Terminal exon vs. non-terminal exon. The last exon of transcript A ends within an internal exon of B.
    • Current algorithm: expand the last exon until end of B.
    • Proposed change: expand the last exon of A until its end is the same as the one of the overlapping exon. Add all the remaining exons of B.
  • Terminal exon vs non-terminal intron. The last exon of transcript A ends within an internal intron of B.
    • Current algorithm: expand the last exon until end of B.
    • Proposed change: expand the last exon of A until the end of the exon following the intron of B. Then, add the remaining exons to A.
@lucventurini
Copy link
Collaborator Author

The functionality is here, however, I should add some tests to make sure that we have covered at least all the basic cases.

@lucventurini
Copy link
Collaborator Author

Mikado now uses the internal interval tree of exon/intron segments to find all the overlapping segments. This should ensure that we can deal with all cases.

@lucventurini
Copy link
Collaborator Author

The procedure is now as follows:

  • After adding all the putative transcript events, check which ones have a score over the threshold, keep only the first N transcripts that pass this requirements (N=max_isoforms)
  • If we had to discard transcripts, recalculate metrics and scores, and recheck.
  • Check for retained introns. Depending on the settings, any transcript with retained introns will either be flagged or removed.
  • If they are removed, recalculate metrics and scores, restart from 1.
  • If we have to execute the padding:
    • Create a copy of the transcripts in the locus.
    • Calculate the padded transcripts, keep track of those we used as templates for the padding
    • Recalculate metrics and scores
    • Check if we have created non-valid transcripts by padding:
      • if the invalid transcripts have not been used as templates, discard them and continue the procedure.
      • If they have, or we have made the primary transcript invalid: discard the invalid templates and restart.
    • Using the score to derive the insertion order, check whether the modified transcripts are still valid ASEs.
    • Check if any of the retained transcript has now a retained intron.
    • If step e or f find invalid transcripts, repeat point d, ie:
      • If the invalid transcripts have not been used as templates, discard them and continue the procedure.
      • If they have, or we have made the primary transcript invalid: discard the invalid templates and restart.

This involved procedure has multiple fail-checks and should ensure that no transcript is modified in a way that:

  • It ensures that the primary transcript will not be made invalid
  • It ensures that no transcript will stay as ASE if it becomes an invalid ASE after padding
  • It ensures that no transcript will be padded according to the structure of a transcript we ended up discarding.

@lucventurini
Copy link
Collaborator Author

Solved after confirmation by @swarbred

@lucventurini
Copy link
Collaborator Author

As noted by @gemygk and @swarbred:

  • "ts_max_distance" should refer to the cDNA distance, not the genomic distance.
  • "Reference" transcripts are not considered in any particular way for the AS machinery. That means that they will always pass the requirements checks, but they might not be passing the requirements for ASEs. This should be made clear in the documentation.

@lucventurini
Copy link
Collaborator Author

As an addendum, transcripts should not be expanded if the boundary of the expandable transcript ends within a intron. In these cases both expansion options (ie creating a false intron or creating a massive exon) are non-desirable. So we should disable this.

@lucventurini
Copy link
Collaborator Author

lucventurini commented May 20, 2019

@swarbred @gemygk

Refining the padding: a complex case

I am revising the algorithm for the padding. I have already added the part that will make aware Mikado of where a transcript ends (see 0b64818). The problem is that there ambiguous cases that need to be handled in a deterministic manner. Specifically:

t1:  |===|-----|====|--|====|----|====|
t2:  |===|-------------|====|----|=====|--------|==|
t3:  |===|-------------|====|----|=========|---------|====|
t4:  |===|-------------|====|----|=======|
t5:  |===|-------------|====|----|==|--------|===|
t6:  |===|--------|=======|----------------|====|

In this case:

  • T1 is the only expandable transcript: all the others are mutually incompatible
  • T1 is not compatible with T6 (it would mean extending an exon which is completely internal to an intron of the template)
  • T1 is fully compatible with T2, T3, T4
  • T1 might be compatible with T5. This depends on how we feel in adding an intron retention event (the last exon of T1 starts within the second-to-last exon of T5 and terminates within the last intron). How do we feel about this?
  • T2, T3, T4, T5 and T6 are all mutually incompatible in terms of expansion. T1 can be expanded according to the template of one and only one of the other transcripts.

Shifting to directional graphs

The way to break the conundrum:

  • store the relationship between the paddable transcripts in a directional graph.
  • store not only the direction (e.g. T1 could be expanded to T2) but also the distance that would need to be filled. This should take into account both the number of introns and the genomic distance.

So in our example the best choices would probably be, in order:

  • T4 (long exon elongation, no additional splicing)
  • T2 (short exon elongation, additional splicing event)
  • T3 (long exon elongation, additional splicing event)
  • T5 (?; long exon/intron extension).

The final algorithm should therefore:

  • link together T1 to all the valid alternatives
  • recognise T2, T3, T4 (,T5?) as multiple and incompatible "end points" of the path
  • prioritise each of the links according to the distance metric
  • choose one of the extensions, discard the rest
  • potentially we might want to backtrack if the extension becomes invalid. This however would further complicate the algorithm and require more development time.

@swarbred
Copy link
Collaborator

@lucventurini the alternative would be that where you have multiple compatible transcripts which could be used for extension that first you check and eliminate options that would not meet ts_max_distance and ts_max_splices requirements and then of the remaining take the highest scoring transcript, if there is a tie I would be fine with any way of splitting this.

If I was manually annotating your example I would merge t1 into the "best" of the alternative compatible models i.e. which gave the longest CDS or had the most support from evidence. merging into the highest scoring compatible transcript would probably most closely reproduce my choice.

How are we currently dealing with this ? It sounds like a substantial change what you are suggesting.

@lucventurini
Copy link
Collaborator Author

Hi @swarbred,
currently we are dealing with this in a way which is suboptimal, which basically ended up having a random choice. Moreover, as I was storing only the connection between two transcripts (so t1 <=> t4, not the direction, e.g. t1 => t4) I ended up having a hodgepodge. This was fine when the relationship was very linear (ie only expanding based on genomic coordinates) but was inefficient and breaks when shifting to the more sophisticated version of padding we are trying to implement.

Your suggestion of using the score of the transcript as our metric is extremely sensible, though, I will implement it as soon as I can.

@lucventurini
Copy link
Collaborator Author

Hi @swarbred , @gemygk , after e1b204d, now the padding should be fixed.
As written above, now ties will be decided by the scoring.
Although I have tried to test properly within the test suite, the best way will be to try it out on real data.

lucventurini added a commit that referenced this issue Jun 5, 2019
* Switched to PySam for loading and fetching from genome files. Also, improved massively the speed of tests.

* Fixed previous commit

* Fixed travis bug

* Refactoring of check_index for Mikado compare (#166) and fix for #172

* Now Mikado will merge touching (NOT overlapping) exons coming from BED12 files. This should fix an issue with halLiftover

* This commit should fix a bunch of tests for when Mikado is installed with SUDO privileges (#137) potentially also fixing #172.

* Corrected a bug in the printing of transcriptomic BED12 files, corrected a bug in the serialisation of ORFs

* Fixed previous breakage

* Moved the code for checking the index into gene_dict. Also, now GeneDict allows access to positions as well.

* Minor edit to assigner

* Fixing previously broken commit

* Solving a bug which rendered the exclude_utr/protein_coding flags of mikado compare useless.

* Adding the GZI index to the tests directory to avoid permission errors. Addressing #175

* Corrected some testing. Moreover, now Mikado supports the BED12+1 format (ie gffread --bed output)

* Adding a maximum intron length for the default scoring configuration files.

* BROKEN. Proceeding on #142. Now the padding algorithm is aware of where a transcript finishes (intron vs exon). Moreover, we need to change the data structure for padding to a *directional* graph and keep in mind the distance needed to pad a transcript, to solve ambiguous cases in a deterministic (rather than random) way.

* Issue #174: modification to the abstractlocus.py file, to try to solve the issue found by @cschuh.

* #174: this should provide a solution to the issue, which is however only temporary. To be tested.

* #174: making the implicit "for" cycle explicit. Hopefully this should help pinpoint the error better.

* #174: peppered the failing block with try-except statements.

* #174: this should solve it. Now missing external scores in the database will cause Mikado to explicitly fail.

* Fixed #176

* BROKEN. Progress on #142, the code runs, but the tests are broken. **This might be legitimate as we changed the behaviour of the code.**

* Closing #155.

* #174: Now Mikado pick will die informatively if the SQLite3 database has not been found.

* #166: fixed some issues with self-compare

* BROKEN. We have to verify that the padding functions also on the 5' end, but we need to make a new test for that. The test development is in progress.

* The padding now should be tested and correct.

* Fixed previous commit. This should fix #142.
@lucventurini
Copy link
Collaborator Author

Currently the CDS padding is broken. To be fixed ASAP.

lucventurini added a commit that referenced this issue Jun 18, 2019
* This should address #173 (both configuration file and docs) and #158

* Fix #181 and small bug fix for parsing Mikado annotations.

* Progress for #142 - this should fix the wrong ORF calculation for cases when the CDS was open at the 5' end.

* Fixed previous commit (always for #142)

* #142: corrected and tested the issue with one-off exons, for padding.

* This should fix and test #142 for good.

* Removed spurious warning/error messages

* #142: solved a bug which caused truncated transcripts at the 5' end not to be padded.

* #142: solved a problem which caused a false abort for transcripts on the - strand with changed stop codon.

* #142: fixing previous commit

* Pushing the fix for #182 onto the development branch

* Fix #183

* Fix #183 and previous commit

* #183: now Mikado configure will set a seed when generating the configuration file. The seed will be explicitly mentioned in the log.

* #177: made ORF loading slightly faster with pysam. Also made XML serialisation much faster using SQL sessions and multiprocessing.Pool instead of queues.

* Solved annoying bug that caused Mikado to crash with TAIR GFF3s.
lucventurini added a commit that referenced this issue Jun 19, 2019
* Solved a small bug in the Gene class

* This commit should fix some of the performance issues found in Mikado compare when testing in the all vs all (issue #166).

* Updated the CHANGELOG.

* Slight improvements to the generic GFLine class and to the to_gff wrapper

* Solved some assorted bugs, from stop_codon parsing in GTF2 (for Augustus) to avoiding a very costly pragma check on MIDX databases.

* Now Mikado util stats will only return one value for the mode, making the table parsable

* Solved some small bugs introduced by changing the mode for mikado util stats

* Dropping automated support for Python3.5. The conda environment cannot be created successfully, too many packages have not been updated in the original repositories.

* Updating the conda environment to reflect that only Python>=3.6 is now accepted

* Various fixes for managing correctly BED12 files.

* Fix for the previous commit breaking TRAVIS

* Switched to PySam for loading and fetching from genome files. Also, improved massively the speed of tests.

* Fixed previous commit

* Fixed travis bug

* Refactoring of check_index for Mikado compare (#166) and fix for #172

* Now Mikado will merge touching (NOT overlapping) exons coming from BED12 files. This should fix an issue with halLiftover

* This commit should fix a bunch of tests for when Mikado is installed with SUDO privileges (#137) potentially also fixing #172.

* Corrected a bug in the printing of transcriptomic BED12 files, corrected a bug in the serialisation of ORFs

* Fixed previous breakage

* Moved the code for checking the index into gene_dict. Also, now GeneDict allows access to positions as well.

* Minor edit to assigner

* Fixing previously broken commit

* Solving a bug which rendered the exclude_utr/protein_coding flags of mikado compare useless.

* Adding the GZI index to the tests directory to avoid permission errors. Addressing #175

* Corrected some testing. Moreover, now Mikado supports the BED12+1 format (ie gffread --bed output)

* Adding a maximum intron length for the default scoring configuration files.

* BROKEN. Proceeding on #142. Now the padding algorithm is aware of where a transcript finishes (intron vs exon). Moreover, we need to change the data structure for padding to a *directional* graph and keep in mind the distance needed to pad a transcript, to solve ambiguous cases in a deterministic (rather than random) way.

* Issue #174: modification to the abstractlocus.py file, to try to solve the issue found by @cschuh.

* #174: this should provide a solution to the issue, which is however only temporary. To be tested.

* #174: making the implicit "for" cycle explicit. Hopefully this should help pinpoint the error better.

* #174: peppered the failing block with try-except statements.

* #174: this should solve it. Now missing external scores in the database will cause Mikado to explicitly fail.

* Fixed #176

* BROKEN. Progress on #142, the code runs, but the tests are broken. **This might be legitimate as we changed the behaviour of the code.**

* Closing #155.

* #174: Now Mikado pick will die informatively if the SQLite3 database has not been found.

* #166: fixed some issues with self-compare

* BROKEN. We have to verify that the padding functions also on the 5' end, but we need to make a new test for that. The test development is in progress.

* The padding now should be tested and correct.

* Fixed previous commit. This should fix #142.

* Development (#178)

* Switched to PySam for loading and fetching from genome files. Also, improved massively the speed of tests.

* Fixed previous commit

* Fixed travis bug

* Refactoring of check_index for Mikado compare (#166) and fix for #172

* Now Mikado will merge touching (NOT overlapping) exons coming from BED12 files. This should fix an issue with halLiftover

* This commit should fix a bunch of tests for when Mikado is installed with SUDO privileges (#137) potentially also fixing #172.

* Corrected a bug in the printing of transcriptomic BED12 files, corrected a bug in the serialisation of ORFs

* Fixed previous breakage

* Moved the code for checking the index into gene_dict. Also, now GeneDict allows access to positions as well.

* Minor edit to assigner

* Fixing previously broken commit

* Solving a bug which rendered the exclude_utr/protein_coding flags of mikado compare useless.

* Adding the GZI index to the tests directory to avoid permission errors. Addressing #175

* Corrected some testing. Moreover, now Mikado supports the BED12+1 format (ie gffread --bed output)

* Adding a maximum intron length for the default scoring configuration files.

* BROKEN. Proceeding on #142. Now the padding algorithm is aware of where a transcript finishes (intron vs exon). Moreover, we need to change the data structure for padding to a *directional* graph and keep in mind the distance needed to pad a transcript, to solve ambiguous cases in a deterministic (rather than random) way.

* Issue #174: modification to the abstractlocus.py file, to try to solve the issue found by @cschuh.

* #174: this should provide a solution to the issue, which is however only temporary. To be tested.

* #174: making the implicit "for" cycle explicit. Hopefully this should help pinpoint the error better.

* #174: peppered the failing block with try-except statements.

* #174: this should solve it. Now missing external scores in the database will cause Mikado to explicitly fail.

* Fixed #176

* BROKEN. Progress on #142, the code runs, but the tests are broken. **This might be legitimate as we changed the behaviour of the code.**

* Closing #155.

* #174: Now Mikado pick will die informatively if the SQLite3 database has not been found.

* #166: fixed some issues with self-compare

* BROKEN. We have to verify that the padding functions also on the 5' end, but we need to make a new test for that. The test development is in progress.

* The padding now should be tested and correct.

* Fixed previous commit. This should fix #142.

* Update Singularity.centos.def

Changed python to python3 during %post, otherwise it will use the system python2.7...

* Fixed small bug in external metrics handling

* Update Singularity.centos.def

* This should address #173 (both configuration file and docs) and #158

* Fix #181 and small bug fix for parsing Mikado annotations.

* Progress for #142 - this should fix the wrong ORF calculation for cases when the CDS was open at the 5' end.

* Fixed previous commit (always for #142)

* #142: corrected and tested the issue with one-off exons, for padding.

* This should fix and test #142 for good.

* Removed spurious warning/error messages

* #142: solved a bug which caused truncated transcripts at the 5' end not to be padded.

* #142: solved a problem which caused a false abort for transcripts on the - strand with changed stop codon.

* #142: fixing previous commit

* Pushing the fix for #182 onto the development branch

* Fix #183

* Fix #183 and previous commit

* #183: now Mikado configure will set a seed when generating the configuration file. The seed will be explicitly mentioned in the log.

* #177: made ORF loading slightly faster with pysam. Also made XML serialisation much faster using SQL sessions and multiprocessing.Pool instead of queues.

* Solved annoying bug that caused Mikado to crash with TAIR GFF3s.

* Development (#184)

* This should address #173 (both configuration file and docs) and #158

* Fix #181 and small bug fix for parsing Mikado annotations.

* Progress for #142 - this should fix the wrong ORF calculation for cases when the CDS was open at the 5' end.

* Fixed previous commit (always for #142)

* #142: corrected and tested the issue with one-off exons, for padding.

* This should fix and test #142 for good.

* Removed spurious warning/error messages

* #142: solved a bug which caused truncated transcripts at the 5' end not to be padded.

* #142: solved a problem which caused a false abort for transcripts on the - strand with changed stop codon.

* #142: fixing previous commit

* Pushing the fix for #182 onto the development branch

* Fix #183

* Fix #183 and previous commit

* #183: now Mikado configure will set a seed when generating the configuration file. The seed will be explicitly mentioned in the log.

* #177: made ORF loading slightly faster with pysam. Also made XML serialisation much faster using SQL sessions and multiprocessing.Pool instead of queues.

* Solved annoying bug that caused Mikado to crash with TAIR GFF3s.
@lucventurini lucventurini reopened this Jun 19, 2019
@lucventurini
Copy link
Collaborator Author

Hi @gemygk, @swarbred, @cschu, am I correct in saying that you have not found any new errors in the latest runs?
if that is the case, we might close this issue.

@lucventurini
Copy link
Collaborator Author

Fixed as the current status.

@lucventurini lucventurini added this to Closed in Version 2 Oct 15, 2020
lucventurini added a commit to lucventurini/mikado that referenced this issue Feb 11, 2021
lucventurini added a commit to lucventurini/mikado that referenced this issue Feb 11, 2021
…king for the interface with introns, not just exons
lucventurini added a commit to lucventurini/mikado that referenced this issue Feb 11, 2021
lucventurini added a commit to lucventurini/mikado that referenced this issue Feb 11, 2021
lucventurini added a commit to lucventurini/mikado that referenced this issue Feb 11, 2021
lucventurini added a commit to lucventurini/mikado that referenced this issue Feb 11, 2021
lucventurini added a commit to lucventurini/mikado that referenced this issue Feb 11, 2021
lucventurini added a commit to lucventurini/mikado that referenced this issue Feb 11, 2021
lucventurini added a commit to lucventurini/mikado that referenced this issue Feb 11, 2021
* Solved a small bug in the Gene class

* This commit should fix some of the performance issues found in Mikado compare when testing in the all vs all (issue EI-CoreBioinformatics#166).

* Updated the CHANGELOG.

* Slight improvements to the generic GFLine class and to the to_gff wrapper

* Solved some assorted bugs, from stop_codon parsing in GTF2 (for Augustus) to avoiding a very costly pragma check on MIDX databases.

* Now Mikado util stats will only return one value for the mode, making the table parsable

* Solved some small bugs introduced by changing the mode for mikado util stats

* Dropping automated support for Python3.5. The conda environment cannot be created successfully, too many packages have not been updated in the original repositories.

* Updating the conda environment to reflect that only Python>=3.6 is now accepted

* Various fixes for managing correctly BED12 files.

* Fix for the previous commit breaking TRAVIS

* Switched to PySam for loading and fetching from genome files. Also, improved massively the speed of tests.

* Fixed previous commit

* Fixed travis bug

* Refactoring of check_index for Mikado compare (EI-CoreBioinformatics#166) and fix for EI-CoreBioinformatics#172

* Now Mikado will merge touching (NOT overlapping) exons coming from BED12 files. This should fix an issue with halLiftover

* This commit should fix a bunch of tests for when Mikado is installed with SUDO privileges (EI-CoreBioinformatics#137) potentially also fixing EI-CoreBioinformatics#172.

* Corrected a bug in the printing of transcriptomic BED12 files, corrected a bug in the serialisation of ORFs

* Fixed previous breakage

* Moved the code for checking the index into gene_dict. Also, now GeneDict allows access to positions as well.

* Minor edit to assigner

* Fixing previously broken commit

* Solving a bug which rendered the exclude_utr/protein_coding flags of mikado compare useless.

* Adding the GZI index to the tests directory to avoid permission errors. Addressing EI-CoreBioinformatics#175

* Corrected some testing. Moreover, now Mikado supports the BED12+1 format (ie gffread --bed output)

* Adding a maximum intron length for the default scoring configuration files.

* BROKEN. Proceeding on EI-CoreBioinformatics#142. Now the padding algorithm is aware of where a transcript finishes (intron vs exon). Moreover, we need to change the data structure for padding to a *directional* graph and keep in mind the distance needed to pad a transcript, to solve ambiguous cases in a deterministic (rather than random) way.

* Issue EI-CoreBioinformatics#174: modification to the abstractlocus.py file, to try to solve the issue found by @cschuh.

* EI-CoreBioinformatics#174: this should provide a solution to the issue, which is however only temporary. To be tested.

* EI-CoreBioinformatics#174: making the implicit "for" cycle explicit. Hopefully this should help pinpoint the error better.

* EI-CoreBioinformatics#174: peppered the failing block with try-except statements.

* EI-CoreBioinformatics#174: this should solve it. Now missing external scores in the database will cause Mikado to explicitly fail.

* Fixed EI-CoreBioinformatics#176

* BROKEN. Progress on EI-CoreBioinformatics#142, the code runs, but the tests are broken. **This might be legitimate as we changed the behaviour of the code.**

* Closing EI-CoreBioinformatics#155.

* EI-CoreBioinformatics#174: Now Mikado pick will die informatively if the SQLite3 database has not been found.

* EI-CoreBioinformatics#166: fixed some issues with self-compare

* BROKEN. We have to verify that the padding functions also on the 5' end, but we need to make a new test for that. The test development is in progress.

* The padding now should be tested and correct.

* Fixed previous commit. This should fix EI-CoreBioinformatics#142.

* Development (EI-CoreBioinformatics#178)

* Switched to PySam for loading and fetching from genome files. Also, improved massively the speed of tests.

* Fixed previous commit

* Fixed travis bug

* Refactoring of check_index for Mikado compare (EI-CoreBioinformatics#166) and fix for EI-CoreBioinformatics#172

* Now Mikado will merge touching (NOT overlapping) exons coming from BED12 files. This should fix an issue with halLiftover

* This commit should fix a bunch of tests for when Mikado is installed with SUDO privileges (EI-CoreBioinformatics#137) potentially also fixing EI-CoreBioinformatics#172.

* Corrected a bug in the printing of transcriptomic BED12 files, corrected a bug in the serialisation of ORFs

* Fixed previous breakage

* Moved the code for checking the index into gene_dict. Also, now GeneDict allows access to positions as well.

* Minor edit to assigner

* Fixing previously broken commit

* Solving a bug which rendered the exclude_utr/protein_coding flags of mikado compare useless.

* Adding the GZI index to the tests directory to avoid permission errors. Addressing EI-CoreBioinformatics#175

* Corrected some testing. Moreover, now Mikado supports the BED12+1 format (ie gffread --bed output)

* Adding a maximum intron length for the default scoring configuration files.

* BROKEN. Proceeding on EI-CoreBioinformatics#142. Now the padding algorithm is aware of where a transcript finishes (intron vs exon). Moreover, we need to change the data structure for padding to a *directional* graph and keep in mind the distance needed to pad a transcript, to solve ambiguous cases in a deterministic (rather than random) way.

* Issue EI-CoreBioinformatics#174: modification to the abstractlocus.py file, to try to solve the issue found by @cschuh.

* EI-CoreBioinformatics#174: this should provide a solution to the issue, which is however only temporary. To be tested.

* EI-CoreBioinformatics#174: making the implicit "for" cycle explicit. Hopefully this should help pinpoint the error better.

* EI-CoreBioinformatics#174: peppered the failing block with try-except statements.

* EI-CoreBioinformatics#174: this should solve it. Now missing external scores in the database will cause Mikado to explicitly fail.

* Fixed EI-CoreBioinformatics#176

* BROKEN. Progress on EI-CoreBioinformatics#142, the code runs, but the tests are broken. **This might be legitimate as we changed the behaviour of the code.**

* Closing EI-CoreBioinformatics#155.

* EI-CoreBioinformatics#174: Now Mikado pick will die informatively if the SQLite3 database has not been found.

* EI-CoreBioinformatics#166: fixed some issues with self-compare

* BROKEN. We have to verify that the padding functions also on the 5' end, but we need to make a new test for that. The test development is in progress.

* The padding now should be tested and correct.

* Fixed previous commit. This should fix EI-CoreBioinformatics#142.

* Update Singularity.centos.def

Changed python to python3 during %post, otherwise it will use the system python2.7...

* Fixed small bug in external metrics handling

* Update Singularity.centos.def

* Development (EI-CoreBioinformatics#184)

* This should address EI-CoreBioinformatics#173 (both configuration file and docs) and EI-CoreBioinformatics#158

* Fix EI-CoreBioinformatics#181 and small bug fix for parsing Mikado annotations.

* Progress for EI-CoreBioinformatics#142 - this should fix the wrong ORF calculation for cases when the CDS was open at the 5' end.

* Fixed previous commit (always for EI-CoreBioinformatics#142)

* EI-CoreBioinformatics#142: corrected and tested the issue with one-off exons, for padding.

* This should fix and test EI-CoreBioinformatics#142 for good.

* Removed spurious warning/error messages

* EI-CoreBioinformatics#142: solved a bug which caused truncated transcripts at the 5' end not to be padded.

* EI-CoreBioinformatics#142: solved a problem which caused a false abort for transcripts on the - strand with changed stop codon.

* EI-CoreBioinformatics#142: fixing previous commit

* Pushing the fix for EI-CoreBioinformatics#182 onto the development branch

* Fix EI-CoreBioinformatics#183

* Fix EI-CoreBioinformatics#183 and previous commit

* EI-CoreBioinformatics#183: now Mikado configure will set a seed when generating the configuration file. The seed will be explicitly mentioned in the log.

* EI-CoreBioinformatics#177: made ORF loading slightly faster with pysam. Also made XML serialisation much faster using SQL sessions and multiprocessing.Pool instead of queues.

* Solved annoying bug that caused Mikado to crash with TAIR GFF3s.
lucventurini added a commit to lucventurini/mikado that referenced this issue Feb 11, 2021
* Solved a small bug in the Gene class

* This commit should fix some of the performance issues found in Mikado compare when testing in the all vs all (issue EI-CoreBioinformatics#166).

* Updated the CHANGELOG.

* Slight improvements to the generic GFLine class and to the to_gff wrapper

* Solved some assorted bugs, from stop_codon parsing in GTF2 (for Augustus) to avoiding a very costly pragma check on MIDX databases.

* Now Mikado util stats will only return one value for the mode, making the table parsable

* Solved some small bugs introduced by changing the mode for mikado util stats

* Dropping automated support for Python3.5. The conda environment cannot be created successfully, too many packages have not been updated in the original repositories.

* Updating the conda environment to reflect that only Python>=3.6 is now accepted

* Various fixes for managing correctly BED12 files.

* Fix for the previous commit breaking TRAVIS

* Switched to PySam for loading and fetching from genome files. Also, improved massively the speed of tests.

* Fixed previous commit

* Fixed travis bug

* Refactoring of check_index for Mikado compare (EI-CoreBioinformatics#166) and fix for EI-CoreBioinformatics#172

* Now Mikado will merge touching (NOT overlapping) exons coming from BED12 files. This should fix an issue with halLiftover

* This commit should fix a bunch of tests for when Mikado is installed with SUDO privileges (EI-CoreBioinformatics#137) potentially also fixing EI-CoreBioinformatics#172.

* Corrected a bug in the printing of transcriptomic BED12 files, corrected a bug in the serialisation of ORFs

* Fixed previous breakage

* Moved the code for checking the index into gene_dict. Also, now GeneDict allows access to positions as well.

* Minor edit to assigner

* Fixing previously broken commit

* Solving a bug which rendered the exclude_utr/protein_coding flags of mikado compare useless.

* Adding the GZI index to the tests directory to avoid permission errors. Addressing EI-CoreBioinformatics#175

* Corrected some testing. Moreover, now Mikado supports the BED12+1 format (ie gffread --bed output)

* Adding a maximum intron length for the default scoring configuration files.

* BROKEN. Proceeding on EI-CoreBioinformatics#142. Now the padding algorithm is aware of where a transcript finishes (intron vs exon). Moreover, we need to change the data structure for padding to a *directional* graph and keep in mind the distance needed to pad a transcript, to solve ambiguous cases in a deterministic (rather than random) way.

* Issue EI-CoreBioinformatics#174: modification to the abstractlocus.py file, to try to solve the issue found by @cschuh.

* EI-CoreBioinformatics#174: this should provide a solution to the issue, which is however only temporary. To be tested.

* EI-CoreBioinformatics#174: making the implicit "for" cycle explicit. Hopefully this should help pinpoint the error better.

* EI-CoreBioinformatics#174: peppered the failing block with try-except statements.

* EI-CoreBioinformatics#174: this should solve it. Now missing external scores in the database will cause Mikado to explicitly fail.

* Fixed EI-CoreBioinformatics#176

* BROKEN. Progress on EI-CoreBioinformatics#142, the code runs, but the tests are broken. **This might be legitimate as we changed the behaviour of the code.**

* Closing EI-CoreBioinformatics#155.

* EI-CoreBioinformatics#174: Now Mikado pick will die informatively if the SQLite3 database has not been found.

* EI-CoreBioinformatics#166: fixed some issues with self-compare

* BROKEN. We have to verify that the padding functions also on the 5' end, but we need to make a new test for that. The test development is in progress.

* The padding now should be tested and correct.

* Fixed previous commit. This should fix EI-CoreBioinformatics#142.

* Development (EI-CoreBioinformatics#178)

* Switched to PySam for loading and fetching from genome files. Also, improved massively the speed of tests.

* Fixed previous commit

* Fixed travis bug

* Refactoring of check_index for Mikado compare (EI-CoreBioinformatics#166) and fix for EI-CoreBioinformatics#172

* Now Mikado will merge touching (NOT overlapping) exons coming from BED12 files. This should fix an issue with halLiftover

* This commit should fix a bunch of tests for when Mikado is installed with SUDO privileges (EI-CoreBioinformatics#137) potentially also fixing EI-CoreBioinformatics#172.

* Corrected a bug in the printing of transcriptomic BED12 files, corrected a bug in the serialisation of ORFs

* Fixed previous breakage

* Moved the code for checking the index into gene_dict. Also, now GeneDict allows access to positions as well.

* Minor edit to assigner

* Fixing previously broken commit

* Solving a bug which rendered the exclude_utr/protein_coding flags of mikado compare useless.

* Adding the GZI index to the tests directory to avoid permission errors. Addressing EI-CoreBioinformatics#175

* Corrected some testing. Moreover, now Mikado supports the BED12+1 format (ie gffread --bed output)

* Adding a maximum intron length for the default scoring configuration files.

* BROKEN. Proceeding on EI-CoreBioinformatics#142. Now the padding algorithm is aware of where a transcript finishes (intron vs exon). Moreover, we need to change the data structure for padding to a *directional* graph and keep in mind the distance needed to pad a transcript, to solve ambiguous cases in a deterministic (rather than random) way.

* Issue EI-CoreBioinformatics#174: modification to the abstractlocus.py file, to try to solve the issue found by @cschuh.

* EI-CoreBioinformatics#174: this should provide a solution to the issue, which is however only temporary. To be tested.

* EI-CoreBioinformatics#174: making the implicit "for" cycle explicit. Hopefully this should help pinpoint the error better.

* EI-CoreBioinformatics#174: peppered the failing block with try-except statements.

* EI-CoreBioinformatics#174: this should solve it. Now missing external scores in the database will cause Mikado to explicitly fail.

* Fixed EI-CoreBioinformatics#176

* BROKEN. Progress on EI-CoreBioinformatics#142, the code runs, but the tests are broken. **This might be legitimate as we changed the behaviour of the code.**

* Closing EI-CoreBioinformatics#155.

* EI-CoreBioinformatics#174: Now Mikado pick will die informatively if the SQLite3 database has not been found.

* EI-CoreBioinformatics#166: fixed some issues with self-compare

* BROKEN. We have to verify that the padding functions also on the 5' end, but we need to make a new test for that. The test development is in progress.

* The padding now should be tested and correct.

* Fixed previous commit. This should fix EI-CoreBioinformatics#142.

* Update Singularity.centos.def

Changed python to python3 during %post, otherwise it will use the system python2.7...

* Fixed small bug in external metrics handling

* Update Singularity.centos.def

* This should address EI-CoreBioinformatics#173 (both configuration file and docs) and EI-CoreBioinformatics#158

* Fix EI-CoreBioinformatics#181 and small bug fix for parsing Mikado annotations.

* Progress for EI-CoreBioinformatics#142 - this should fix the wrong ORF calculation for cases when the CDS was open at the 5' end.

* Fixed previous commit (always for EI-CoreBioinformatics#142)

* EI-CoreBioinformatics#142: corrected and tested the issue with one-off exons, for padding.

* This should fix and test EI-CoreBioinformatics#142 for good.

* Removed spurious warning/error messages

* EI-CoreBioinformatics#142: solved a bug which caused truncated transcripts at the 5' end not to be padded.

* EI-CoreBioinformatics#142: solved a problem which caused a false abort for transcripts on the - strand with changed stop codon.

* EI-CoreBioinformatics#142: fixing previous commit

* Pushing the fix for EI-CoreBioinformatics#182 onto the development branch

* Fix EI-CoreBioinformatics#183

* Fix EI-CoreBioinformatics#183 and previous commit

* EI-CoreBioinformatics#183: now Mikado configure will set a seed when generating the configuration file. The seed will be explicitly mentioned in the log.

* EI-CoreBioinformatics#177: made ORF loading slightly faster with pysam. Also made XML serialisation much faster using SQL sessions and multiprocessing.Pool instead of queues.

* Solved annoying bug that caused Mikado to crash with TAIR GFF3s.

* Development (EI-CoreBioinformatics#184)

* This should address EI-CoreBioinformatics#173 (both configuration file and docs) and EI-CoreBioinformatics#158

* Fix EI-CoreBioinformatics#181 and small bug fix for parsing Mikado annotations.

* Progress for EI-CoreBioinformatics#142 - this should fix the wrong ORF calculation for cases when the CDS was open at the 5' end.

* Fixed previous commit (always for EI-CoreBioinformatics#142)

* EI-CoreBioinformatics#142: corrected and tested the issue with one-off exons, for padding.

* This should fix and test EI-CoreBioinformatics#142 for good.

* Removed spurious warning/error messages

* EI-CoreBioinformatics#142: solved a bug which caused truncated transcripts at the 5' end not to be padded.

* EI-CoreBioinformatics#142: solved a problem which caused a false abort for transcripts on the - strand with changed stop codon.

* EI-CoreBioinformatics#142: fixing previous commit

* Pushing the fix for EI-CoreBioinformatics#182 onto the development branch

* Fix EI-CoreBioinformatics#183

* Fix EI-CoreBioinformatics#183 and previous commit

* EI-CoreBioinformatics#183: now Mikado configure will set a seed when generating the configuration file. The seed will be explicitly mentioned in the log.

* EI-CoreBioinformatics#177: made ORF loading slightly faster with pysam. Also made XML serialisation much faster using SQL sessions and multiprocessing.Pool instead of queues.

* Solved annoying bug that caused Mikado to crash with TAIR GFF3s.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
Development

No branches or pull requests

3 participants