Smarter transcript padding #142

lucventurini · 2018-11-05T12:20:26Z

At the moment Mikado performs a relatively stupid expansion: it checks whether the end of the other transcript is downstream of the one we are observing, and if it is, it will just expand the transcript until the end. However, this means crossing all the potential introns downstream, and therefore, result in a transcript with one or more retained introns (by definition).

There are four broad cases, that I can think of, and that we are treating in the same way.
Notation: “A” is the transcript to be expanded, “B” the template. For simplicity, they are both on the “+” strand, and we are expanding their 3’ end. The reasoning would be analogous if we were reasoning in any of the other three potential orientations (“+” on the 5’ end, etc.)

Exon vs exon. The last exon of transcript A ends within the last exon of transcript B.
- Current algorithm: expand until end of B.
- Proposed change: None.
Terminal exon vs. terminal intron. The last exon of transcript A ends within the last intron of transcript B.
- Current algorithm: expand until end of B.
- Proposed change: None.
Terminal exon vs. non-terminal exon. The last exon of transcript A ends within an internal exon of B.
- Current algorithm: expand the last exon until end of B.
- Proposed change: expand the last exon of A until its end is the same as the one of the overlapping exon. Add all the remaining exons of B.
Terminal exon vs non-terminal intron. The last exon of transcript A ends within an internal intron of B.
- Current algorithm: expand the last exon until end of B.
- Proposed change: expand the last exon of A until the end of the exon following the intron of B. Then, add the remaining exons to A.

lucventurini · 2018-11-05T16:14:55Z

The functionality is here, however, I should add some tests to make sure that we have covered at least all the basic cases.

lucventurini · 2018-11-06T18:47:14Z

Mikado now uses the internal interval tree of exon/intron segments to find all the overlapping segments. This should ensure that we can deal with all cases.

lucventurini · 2018-11-07T13:58:01Z

The procedure is now as follows:

After adding all the putative transcript events, check which ones have a score over the threshold, keep only the first N transcripts that pass this requirements (N=max_isoforms)
If we had to discard transcripts, recalculate metrics and scores, and recheck.
Check for retained introns. Depending on the settings, any transcript with retained introns will either be flagged or removed.
If they are removed, recalculate metrics and scores, restart from 1.
If we have to execute the padding:
- Create a copy of the transcripts in the locus.
- Calculate the padded transcripts, keep track of those we used as templates for the padding
- Recalculate metrics and scores
- Check if we have created non-valid transcripts by padding:
  - if the invalid transcripts have not been used as templates, discard them and continue the procedure.
  - If they have, or we have made the primary transcript invalid: discard the invalid templates and restart.
- Using the score to derive the insertion order, check whether the modified transcripts are still valid ASEs.
- Check if any of the retained transcript has now a retained intron.
- If step e or f find invalid transcripts, repeat point d, ie:
  - If the invalid transcripts have not been used as templates, discard them and continue the procedure.
  - If they have, or we have made the primary transcript invalid: discard the invalid templates and restart.

This involved procedure has multiple fail-checks and should ensure that no transcript is modified in a way that:

It ensures that the primary transcript will not be made invalid
It ensures that no transcript will stay as ASE if it becomes an invalid ASE after padding
It ensures that no transcript will be padded according to the structure of a transcript we ended up discarding.

lucventurini · 2019-02-21T14:02:11Z

Solved after confirmation by @swarbred

lucventurini · 2019-05-17T13:05:38Z

As noted by @gemygk and @swarbred:

"ts_max_distance" should refer to the cDNA distance, not the genomic distance.
"Reference" transcripts are not considered in any particular way for the AS machinery. That means that they will always pass the requirements checks, but they might not be passing the requirements for ASEs. This should be made clear in the documentation.

lucventurini · 2019-05-17T13:25:09Z

As an addendum, transcripts should not be expanded if the boundary of the expandable transcript ends within a intron. In these cases both expansion options (ie creating a false intron or creating a massive exon) are non-desirable. So we should disable this.

lucventurini · 2019-05-20T10:38:41Z

@swarbred @gemygk

Refining the padding: a complex case

I am revising the algorithm for the padding. I have already added the part that will make aware Mikado of where a transcript ends (see 0b64818). The problem is that there ambiguous cases that need to be handled in a deterministic manner. Specifically:

t1:  |===|-----|====|--|====|----|====|
t2:  |===|-------------|====|----|=====|--------|==|
t3:  |===|-------------|====|----|=========|---------|====|
t4:  |===|-------------|====|----|=======|
t5:  |===|-------------|====|----|==|--------|===|
t6:  |===|--------|=======|----------------|====|

In this case:

T1 is the only expandable transcript: all the others are mutually incompatible
T1 is not compatible with T6 (it would mean extending an exon which is completely internal to an intron of the template)
T1 is fully compatible with T2, T3, T4
T1 might be compatible with T5. This depends on how we feel in adding an intron retention event (the last exon of T1 starts within the second-to-last exon of T5 and terminates within the last intron). How do we feel about this?
T2, T3, T4, T5 and T6 are all mutually incompatible in terms of expansion. T1 can be expanded according to the template of one and only one of the other transcripts.

Shifting to directional graphs

The way to break the conundrum:

store the relationship between the paddable transcripts in a directional graph.
store not only the direction (e.g. T1 could be expanded to T2) but also the distance that would need to be filled. This should take into account both the number of introns and the genomic distance.

So in our example the best choices would probably be, in order:

T4 (long exon elongation, no additional splicing)
T2 (short exon elongation, additional splicing event)
T3 (long exon elongation, additional splicing event)
T5 (?; long exon/intron extension).

The final algorithm should therefore:

link together T1 to all the valid alternatives
recognise T2, T3, T4 (,T5?) as multiple and incompatible "end points" of the path
prioritise each of the links according to the distance metric
choose one of the extensions, discard the rest
potentially we might want to backtrack if the extension becomes invalid. This however would further complicate the algorithm and require more development time.

swarbred · 2019-05-23T15:35:42Z

@lucventurini the alternative would be that where you have multiple compatible transcripts which could be used for extension that first you check and eliminate options that would not meet ts_max_distance and ts_max_splices requirements and then of the remaining take the highest scoring transcript, if there is a tie I would be fine with any way of splitting this.

If I was manually annotating your example I would merge t1 into the "best" of the alternative compatible models i.e. which gave the longest CDS or had the most support from evidence. merging into the highest scoring compatible transcript would probably most closely reproduce my choice.

How are we currently dealing with this ? It sounds like a substantial change what you are suggesting.

lucventurini · 2019-05-23T16:51:47Z

Hi @swarbred,
currently we are dealing with this in a way which is suboptimal, which basically ended up having a random choice. Moreover, as I was storing only the connection between two transcripts (so t1 <=> t4, not the direction, e.g. t1 => t4) I ended up having a hodgepodge. This was fine when the relationship was very linear (ie only expanding based on genomic coordinates) but was inefficient and breaks when shifting to the more sophisticated version of padding we are trying to implement.

Your suggestion of using the score of the transcript as our metric is extremely sensible, though, I will implement it as soon as I can.

lucventurini · 2019-06-03T16:54:02Z

Hi @swarbred , @gemygk , after e1b204d, now the padding should be fixed.
As written above, now ties will be decided by the scoring.
Although I have tried to test properly within the test suite, the best way will be to try it out on real data.

@cschuh

* Switched to PySam for loading and fetching from genome files. Also, improved massively the speed of tests. * Fixed previous commit * Fixed travis bug * Refactoring of check_index for Mikado compare (#166) and fix for #172 * Now Mikado will merge touching (NOT overlapping) exons coming from BED12 files. This should fix an issue with halLiftover * This commit should fix a bunch of tests for when Mikado is installed with SUDO privileges (#137) potentially also fixing #172. * Corrected a bug in the printing of transcriptomic BED12 files, corrected a bug in the serialisation of ORFs * Fixed previous breakage * Moved the code for checking the index into gene_dict. Also, now GeneDict allows access to positions as well. * Minor edit to assigner * Fixing previously broken commit * Solving a bug which rendered the exclude_utr/protein_coding flags of mikado compare useless. * Adding the GZI index to the tests directory to avoid permission errors. Addressing #175 * Corrected some testing. Moreover, now Mikado supports the BED12+1 format (ie gffread --bed output) * Adding a maximum intron length for the default scoring configuration files. * BROKEN. Proceeding on #142. Now the padding algorithm is aware of where a transcript finishes (intron vs exon). Moreover, we need to change the data structure for padding to a *directional* graph and keep in mind the distance needed to pad a transcript, to solve ambiguous cases in a deterministic (rather than random) way. * Issue #174: modification to the abstractlocus.py file, to try to solve the issue found by @cschuh. * #174: this should provide a solution to the issue, which is however only temporary. To be tested. * #174: making the implicit "for" cycle explicit. Hopefully this should help pinpoint the error better. * #174: peppered the failing block with try-except statements. * #174: this should solve it. Now missing external scores in the database will cause Mikado to explicitly fail. * Fixed #176 * BROKEN. Progress on #142, the code runs, but the tests are broken. **This might be legitimate as we changed the behaviour of the code.** * Closing #155. * #174: Now Mikado pick will die informatively if the SQLite3 database has not been found. * #166: fixed some issues with self-compare * BROKEN. We have to verify that the padding functions also on the 5' end, but we need to make a new test for that. The test development is in progress. * The padding now should be tested and correct. * Fixed previous commit. This should fix #142.

lucventurini · 2019-06-06T18:09:25Z

Currently the CDS padding is broken. To be fixed ASAP.

* This should address #173 (both configuration file and docs) and #158 * Fix #181 and small bug fix for parsing Mikado annotations. * Progress for #142 - this should fix the wrong ORF calculation for cases when the CDS was open at the 5' end. * Fixed previous commit (always for #142) * #142: corrected and tested the issue with one-off exons, for padding. * This should fix and test #142 for good. * Removed spurious warning/error messages * #142: solved a bug which caused truncated transcripts at the 5' end not to be padded. * #142: solved a problem which caused a false abort for transcripts on the - strand with changed stop codon. * #142: fixing previous commit * Pushing the fix for #182 onto the development branch * Fix #183 * Fix #183 and previous commit * #183: now Mikado configure will set a seed when generating the configuration file. The seed will be explicitly mentioned in the log. * #177: made ORF loading slightly faster with pysam. Also made XML serialisation much faster using SQL sessions and multiprocessing.Pool instead of queues. * Solved annoying bug that caused Mikado to crash with TAIR GFF3s.

@cschuh

* Solved a small bug in the Gene class * This commit should fix some of the performance issues found in Mikado compare when testing in the all vs all (issue #166). * Updated the CHANGELOG. * Slight improvements to the generic GFLine class and to the to_gff wrapper * Solved some assorted bugs, from stop_codon parsing in GTF2 (for Augustus) to avoiding a very costly pragma check on MIDX databases. * Now Mikado util stats will only return one value for the mode, making the table parsable * Solved some small bugs introduced by changing the mode for mikado util stats * Dropping automated support for Python3.5. The conda environment cannot be created successfully, too many packages have not been updated in the original repositories. * Updating the conda environment to reflect that only Python>=3.6 is now accepted * Various fixes for managing correctly BED12 files. * Fix for the previous commit breaking TRAVIS * Switched to PySam for loading and fetching from genome files. Also, improved massively the speed of tests. * Fixed previous commit * Fixed travis bug * Refactoring of check_index for Mikado compare (#166) and fix for #172 * Now Mikado will merge touching (NOT overlapping) exons coming from BED12 files. This should fix an issue with halLiftover * This commit should fix a bunch of tests for when Mikado is installed with SUDO privileges (#137) potentially also fixing #172. * Corrected a bug in the printing of transcriptomic BED12 files, corrected a bug in the serialisation of ORFs * Fixed previous breakage * Moved the code for checking the index into gene_dict. Also, now GeneDict allows access to positions as well. * Minor edit to assigner * Fixing previously broken commit * Solving a bug which rendered the exclude_utr/protein_coding flags of mikado compare useless. * Adding the GZI index to the tests directory to avoid permission errors. Addressing #175 * Corrected some testing. Moreover, now Mikado supports the BED12+1 format (ie gffread --bed output) * Adding a maximum intron length for the default scoring configuration files. * BROKEN. Proceeding on #142. Now the padding algorithm is aware of where a transcript finishes (intron vs exon). Moreover, we need to change the data structure for padding to a *directional* graph and keep in mind the distance needed to pad a transcript, to solve ambiguous cases in a deterministic (rather than random) way. * Issue #174: modification to the abstractlocus.py file, to try to solve the issue found by @cschuh. * #174: this should provide a solution to the issue, which is however only temporary. To be tested. * #174: making the implicit "for" cycle explicit. Hopefully this should help pinpoint the error better. * #174: peppered the failing block with try-except statements. * #174: this should solve it. Now missing external scores in the database will cause Mikado to explicitly fail. * Fixed #176 * BROKEN. Progress on #142, the code runs, but the tests are broken. **This might be legitimate as we changed the behaviour of the code.** * Closing #155. * #174: Now Mikado pick will die informatively if the SQLite3 database has not been found. * #166: fixed some issues with self-compare * BROKEN. We have to verify that the padding functions also on the 5' end, but we need to make a new test for that. The test development is in progress. * The padding now should be tested and correct. * Fixed previous commit. This should fix #142. * Development (#178) * Switched to PySam for loading and fetching from genome files. Also, improved massively the speed of tests. * Fixed previous commit * Fixed travis bug * Refactoring of check_index for Mikado compare (#166) and fix for #172 * Now Mikado will merge touching (NOT overlapping) exons coming from BED12 files. This should fix an issue with halLiftover * This commit should fix a bunch of tests for when Mikado is installed with SUDO privileges (#137) potentially also fixing #172. * Corrected a bug in the printing of transcriptomic BED12 files, corrected a bug in the serialisation of ORFs * Fixed previous breakage * Moved the code for checking the index into gene_dict. Also, now GeneDict allows access to positions as well. * Minor edit to assigner * Fixing previously broken commit * Solving a bug which rendered the exclude_utr/protein_coding flags of mikado compare useless. * Adding the GZI index to the tests directory to avoid permission errors. Addressing #175 * Corrected some testing. Moreover, now Mikado supports the BED12+1 format (ie gffread --bed output) * Adding a maximum intron length for the default scoring configuration files. * BROKEN. Proceeding on #142. Now the padding algorithm is aware of where a transcript finishes (intron vs exon). Moreover, we need to change the data structure for padding to a *directional* graph and keep in mind the distance needed to pad a transcript, to solve ambiguous cases in a deterministic (rather than random) way. * Issue #174: modification to the abstractlocus.py file, to try to solve the issue found by @cschuh. * #174: this should provide a solution to the issue, which is however only temporary. To be tested. * #174: making the implicit "for" cycle explicit. Hopefully this should help pinpoint the error better. * #174: peppered the failing block with try-except statements. * #174: this should solve it. Now missing external scores in the database will cause Mikado to explicitly fail. * Fixed #176 * BROKEN. Progress on #142, the code runs, but the tests are broken. **This might be legitimate as we changed the behaviour of the code.** * Closing #155. * #174: Now Mikado pick will die informatively if the SQLite3 database has not been found. * #166: fixed some issues with self-compare * BROKEN. We have to verify that the padding functions also on the 5' end, but we need to make a new test for that. The test development is in progress. * The padding now should be tested and correct. * Fixed previous commit. This should fix #142. * Update Singularity.centos.def Changed python to python3 during %post, otherwise it will use the system python2.7... * Fixed small bug in external metrics handling * Update Singularity.centos.def * This should address #173 (both configuration file and docs) and #158 * Fix #181 and small bug fix for parsing Mikado annotations. * Progress for #142 - this should fix the wrong ORF calculation for cases when the CDS was open at the 5' end. * Fixed previous commit (always for #142) * #142: corrected and tested the issue with one-off exons, for padding. * This should fix and test #142 for good. * Removed spurious warning/error messages * #142: solved a bug which caused truncated transcripts at the 5' end not to be padded. * #142: solved a problem which caused a false abort for transcripts on the - strand with changed stop codon. * #142: fixing previous commit * Pushing the fix for #182 onto the development branch * Fix #183 * Fix #183 and previous commit * #183: now Mikado configure will set a seed when generating the configuration file. The seed will be explicitly mentioned in the log. * #177: made ORF loading slightly faster with pysam. Also made XML serialisation much faster using SQL sessions and multiprocessing.Pool instead of queues. * Solved annoying bug that caused Mikado to crash with TAIR GFF3s. * Development (#184) * This should address #173 (both configuration file and docs) and #158 * Fix #181 and small bug fix for parsing Mikado annotations. * Progress for #142 - this should fix the wrong ORF calculation for cases when the CDS was open at the 5' end. * Fixed previous commit (always for #142) * #142: corrected and tested the issue with one-off exons, for padding. * This should fix and test #142 for good. * Removed spurious warning/error messages * #142: solved a bug which caused truncated transcripts at the 5' end not to be padded. * #142: solved a problem which caused a false abort for transcripts on the - strand with changed stop codon. * #142: fixing previous commit * Pushing the fix for #182 onto the development branch * Fix #183 * Fix #183 and previous commit * #183: now Mikado configure will set a seed when generating the configuration file. The seed will be explicitly mentioned in the log. * #177: made ORF loading slightly faster with pysam. Also made XML serialisation much faster using SQL sessions and multiprocessing.Pool instead of queues. * Solved annoying bug that caused Mikado to crash with TAIR GFF3s.

lucventurini · 2019-07-01T11:24:16Z

Hi @gemygk, @swarbred, @cschu, am I correct in saying that you have not found any new errors in the latest runs?
if that is the case, we might close this issue.

lucventurini · 2019-07-03T10:39:32Z

Fixed as the current status.

…king for the interface with introns, not just exons

…on (EI-CoreBioinformatics#136)

…well (EI-CoreBioinformatics#137)

…ormatics#142 should be solved.

…matics#142, implemented new unit-test for EI-CoreBioinformatics#137

…informatics#137

@cschuh

* Solved a small bug in the Gene class * This commit should fix some of the performance issues found in Mikado compare when testing in the all vs all (issue EI-CoreBioinformatics#166). * Updated the CHANGELOG. * Slight improvements to the generic GFLine class and to the to_gff wrapper * Solved some assorted bugs, from stop_codon parsing in GTF2 (for Augustus) to avoiding a very costly pragma check on MIDX databases. * Now Mikado util stats will only return one value for the mode, making the table parsable * Solved some small bugs introduced by changing the mode for mikado util stats * Dropping automated support for Python3.5. The conda environment cannot be created successfully, too many packages have not been updated in the original repositories. * Updating the conda environment to reflect that only Python>=3.6 is now accepted * Various fixes for managing correctly BED12 files. * Fix for the previous commit breaking TRAVIS * Switched to PySam for loading and fetching from genome files. Also, improved massively the speed of tests. * Fixed previous commit * Fixed travis bug * Refactoring of check_index for Mikado compare (EI-CoreBioinformatics#166) and fix for EI-CoreBioinformatics#172 * Now Mikado will merge touching (NOT overlapping) exons coming from BED12 files. This should fix an issue with halLiftover * This commit should fix a bunch of tests for when Mikado is installed with SUDO privileges (EI-CoreBioinformatics#137) potentially also fixing EI-CoreBioinformatics#172. * Corrected a bug in the printing of transcriptomic BED12 files, corrected a bug in the serialisation of ORFs * Fixed previous breakage * Moved the code for checking the index into gene_dict. Also, now GeneDict allows access to positions as well. * Minor edit to assigner * Fixing previously broken commit * Solving a bug which rendered the exclude_utr/protein_coding flags of mikado compare useless. * Adding the GZI index to the tests directory to avoid permission errors. Addressing EI-CoreBioinformatics#175 * Corrected some testing. Moreover, now Mikado supports the BED12+1 format (ie gffread --bed output) * Adding a maximum intron length for the default scoring configuration files. * BROKEN. Proceeding on EI-CoreBioinformatics#142. Now the padding algorithm is aware of where a transcript finishes (intron vs exon). Moreover, we need to change the data structure for padding to a *directional* graph and keep in mind the distance needed to pad a transcript, to solve ambiguous cases in a deterministic (rather than random) way. * Issue EI-CoreBioinformatics#174: modification to the abstractlocus.py file, to try to solve the issue found by @cschuh. * EI-CoreBioinformatics#174: this should provide a solution to the issue, which is however only temporary. To be tested. * EI-CoreBioinformatics#174: making the implicit "for" cycle explicit. Hopefully this should help pinpoint the error better. * EI-CoreBioinformatics#174: peppered the failing block with try-except statements. * EI-CoreBioinformatics#174: this should solve it. Now missing external scores in the database will cause Mikado to explicitly fail. * Fixed EI-CoreBioinformatics#176 * BROKEN. Progress on EI-CoreBioinformatics#142, the code runs, but the tests are broken. **This might be legitimate as we changed the behaviour of the code.** * Closing EI-CoreBioinformatics#155. * EI-CoreBioinformatics#174: Now Mikado pick will die informatively if the SQLite3 database has not been found. * EI-CoreBioinformatics#166: fixed some issues with self-compare * BROKEN. We have to verify that the padding functions also on the 5' end, but we need to make a new test for that. The test development is in progress. * The padding now should be tested and correct. * Fixed previous commit. This should fix EI-CoreBioinformatics#142. * Development (EI-CoreBioinformatics#178) * Switched to PySam for loading and fetching from genome files. Also, improved massively the speed of tests. * Fixed previous commit * Fixed travis bug * Refactoring of check_index for Mikado compare (EI-CoreBioinformatics#166) and fix for EI-CoreBioinformatics#172 * Now Mikado will merge touching (NOT overlapping) exons coming from BED12 files. This should fix an issue with halLiftover * This commit should fix a bunch of tests for when Mikado is installed with SUDO privileges (EI-CoreBioinformatics#137) potentially also fixing EI-CoreBioinformatics#172. * Corrected a bug in the printing of transcriptomic BED12 files, corrected a bug in the serialisation of ORFs * Fixed previous breakage * Moved the code for checking the index into gene_dict. Also, now GeneDict allows access to positions as well. * Minor edit to assigner * Fixing previously broken commit * Solving a bug which rendered the exclude_utr/protein_coding flags of mikado compare useless. * Adding the GZI index to the tests directory to avoid permission errors. Addressing EI-CoreBioinformatics#175 * Corrected some testing. Moreover, now Mikado supports the BED12+1 format (ie gffread --bed output) * Adding a maximum intron length for the default scoring configuration files. * BROKEN. Proceeding on EI-CoreBioinformatics#142. Now the padding algorithm is aware of where a transcript finishes (intron vs exon). Moreover, we need to change the data structure for padding to a *directional* graph and keep in mind the distance needed to pad a transcript, to solve ambiguous cases in a deterministic (rather than random) way. * Issue EI-CoreBioinformatics#174: modification to the abstractlocus.py file, to try to solve the issue found by @cschuh. * EI-CoreBioinformatics#174: this should provide a solution to the issue, which is however only temporary. To be tested. * EI-CoreBioinformatics#174: making the implicit "for" cycle explicit. Hopefully this should help pinpoint the error better. * EI-CoreBioinformatics#174: peppered the failing block with try-except statements. * EI-CoreBioinformatics#174: this should solve it. Now missing external scores in the database will cause Mikado to explicitly fail. * Fixed EI-CoreBioinformatics#176 * BROKEN. Progress on EI-CoreBioinformatics#142, the code runs, but the tests are broken. **This might be legitimate as we changed the behaviour of the code.** * Closing EI-CoreBioinformatics#155. * EI-CoreBioinformatics#174: Now Mikado pick will die informatively if the SQLite3 database has not been found. * EI-CoreBioinformatics#166: fixed some issues with self-compare * BROKEN. We have to verify that the padding functions also on the 5' end, but we need to make a new test for that. The test development is in progress. * The padding now should be tested and correct. * Fixed previous commit. This should fix EI-CoreBioinformatics#142. * Update Singularity.centos.def Changed python to python3 during %post, otherwise it will use the system python2.7... * Fixed small bug in external metrics handling * Update Singularity.centos.def * Development (EI-CoreBioinformatics#184) * This should address EI-CoreBioinformatics#173 (both configuration file and docs) and EI-CoreBioinformatics#158 * Fix EI-CoreBioinformatics#181 and small bug fix for parsing Mikado annotations. * Progress for EI-CoreBioinformatics#142 - this should fix the wrong ORF calculation for cases when the CDS was open at the 5' end. * Fixed previous commit (always for EI-CoreBioinformatics#142) * EI-CoreBioinformatics#142: corrected and tested the issue with one-off exons, for padding. * This should fix and test EI-CoreBioinformatics#142 for good. * Removed spurious warning/error messages * EI-CoreBioinformatics#142: solved a bug which caused truncated transcripts at the 5' end not to be padded. * EI-CoreBioinformatics#142: solved a problem which caused a false abort for transcripts on the - strand with changed stop codon. * EI-CoreBioinformatics#142: fixing previous commit * Pushing the fix for EI-CoreBioinformatics#182 onto the development branch * Fix EI-CoreBioinformatics#183 * Fix EI-CoreBioinformatics#183 and previous commit * EI-CoreBioinformatics#183: now Mikado configure will set a seed when generating the configuration file. The seed will be explicitly mentioned in the log. * EI-CoreBioinformatics#177: made ORF loading slightly faster with pysam. Also made XML serialisation much faster using SQL sessions and multiprocessing.Pool instead of queues. * Solved annoying bug that caused Mikado to crash with TAIR GFF3s.

@cschuh

* Solved a small bug in the Gene class * This commit should fix some of the performance issues found in Mikado compare when testing in the all vs all (issue EI-CoreBioinformatics#166). * Updated the CHANGELOG. * Slight improvements to the generic GFLine class and to the to_gff wrapper * Solved some assorted bugs, from stop_codon parsing in GTF2 (for Augustus) to avoiding a very costly pragma check on MIDX databases. * Now Mikado util stats will only return one value for the mode, making the table parsable * Solved some small bugs introduced by changing the mode for mikado util stats * Dropping automated support for Python3.5. The conda environment cannot be created successfully, too many packages have not been updated in the original repositories. * Updating the conda environment to reflect that only Python>=3.6 is now accepted * Various fixes for managing correctly BED12 files. * Fix for the previous commit breaking TRAVIS * Switched to PySam for loading and fetching from genome files. Also, improved massively the speed of tests. * Fixed previous commit * Fixed travis bug * Refactoring of check_index for Mikado compare (EI-CoreBioinformatics#166) and fix for EI-CoreBioinformatics#172 * Now Mikado will merge touching (NOT overlapping) exons coming from BED12 files. This should fix an issue with halLiftover * This commit should fix a bunch of tests for when Mikado is installed with SUDO privileges (EI-CoreBioinformatics#137) potentially also fixing EI-CoreBioinformatics#172. * Corrected a bug in the printing of transcriptomic BED12 files, corrected a bug in the serialisation of ORFs * Fixed previous breakage * Moved the code for checking the index into gene_dict. Also, now GeneDict allows access to positions as well. * Minor edit to assigner * Fixing previously broken commit * Solving a bug which rendered the exclude_utr/protein_coding flags of mikado compare useless. * Adding the GZI index to the tests directory to avoid permission errors. Addressing EI-CoreBioinformatics#175 * Corrected some testing. Moreover, now Mikado supports the BED12+1 format (ie gffread --bed output) * Adding a maximum intron length for the default scoring configuration files. * BROKEN. Proceeding on EI-CoreBioinformatics#142. Now the padding algorithm is aware of where a transcript finishes (intron vs exon). Moreover, we need to change the data structure for padding to a *directional* graph and keep in mind the distance needed to pad a transcript, to solve ambiguous cases in a deterministic (rather than random) way. * Issue EI-CoreBioinformatics#174: modification to the abstractlocus.py file, to try to solve the issue found by @cschuh. * EI-CoreBioinformatics#174: this should provide a solution to the issue, which is however only temporary. To be tested. * EI-CoreBioinformatics#174: making the implicit "for" cycle explicit. Hopefully this should help pinpoint the error better. * EI-CoreBioinformatics#174: peppered the failing block with try-except statements. * EI-CoreBioinformatics#174: this should solve it. Now missing external scores in the database will cause Mikado to explicitly fail. * Fixed EI-CoreBioinformatics#176 * BROKEN. Progress on EI-CoreBioinformatics#142, the code runs, but the tests are broken. **This might be legitimate as we changed the behaviour of the code.** * Closing EI-CoreBioinformatics#155. * EI-CoreBioinformatics#174: Now Mikado pick will die informatively if the SQLite3 database has not been found. * EI-CoreBioinformatics#166: fixed some issues with self-compare * BROKEN. We have to verify that the padding functions also on the 5' end, but we need to make a new test for that. The test development is in progress. * The padding now should be tested and correct. * Fixed previous commit. This should fix EI-CoreBioinformatics#142. * Development (EI-CoreBioinformatics#178) * Switched to PySam for loading and fetching from genome files. Also, improved massively the speed of tests. * Fixed previous commit * Fixed travis bug * Refactoring of check_index for Mikado compare (EI-CoreBioinformatics#166) and fix for EI-CoreBioinformatics#172 * Now Mikado will merge touching (NOT overlapping) exons coming from BED12 files. This should fix an issue with halLiftover * This commit should fix a bunch of tests for when Mikado is installed with SUDO privileges (EI-CoreBioinformatics#137) potentially also fixing EI-CoreBioinformatics#172. * Corrected a bug in the printing of transcriptomic BED12 files, corrected a bug in the serialisation of ORFs * Fixed previous breakage * Moved the code for checking the index into gene_dict. Also, now GeneDict allows access to positions as well. * Minor edit to assigner * Fixing previously broken commit * Solving a bug which rendered the exclude_utr/protein_coding flags of mikado compare useless. * Adding the GZI index to the tests directory to avoid permission errors. Addressing EI-CoreBioinformatics#175 * Corrected some testing. Moreover, now Mikado supports the BED12+1 format (ie gffread --bed output) * Adding a maximum intron length for the default scoring configuration files. * BROKEN. Proceeding on EI-CoreBioinformatics#142. Now the padding algorithm is aware of where a transcript finishes (intron vs exon). Moreover, we need to change the data structure for padding to a *directional* graph and keep in mind the distance needed to pad a transcript, to solve ambiguous cases in a deterministic (rather than random) way. * Issue EI-CoreBioinformatics#174: modification to the abstractlocus.py file, to try to solve the issue found by @cschuh. * EI-CoreBioinformatics#174: this should provide a solution to the issue, which is however only temporary. To be tested. * EI-CoreBioinformatics#174: making the implicit "for" cycle explicit. Hopefully this should help pinpoint the error better. * EI-CoreBioinformatics#174: peppered the failing block with try-except statements. * EI-CoreBioinformatics#174: this should solve it. Now missing external scores in the database will cause Mikado to explicitly fail. * Fixed EI-CoreBioinformatics#176 * BROKEN. Progress on EI-CoreBioinformatics#142, the code runs, but the tests are broken. **This might be legitimate as we changed the behaviour of the code.** * Closing EI-CoreBioinformatics#155. * EI-CoreBioinformatics#174: Now Mikado pick will die informatively if the SQLite3 database has not been found. * EI-CoreBioinformatics#166: fixed some issues with self-compare * BROKEN. We have to verify that the padding functions also on the 5' end, but we need to make a new test for that. The test development is in progress. * The padding now should be tested and correct. * Fixed previous commit. This should fix EI-CoreBioinformatics#142. * Update Singularity.centos.def Changed python to python3 during %post, otherwise it will use the system python2.7... * Fixed small bug in external metrics handling * Update Singularity.centos.def * This should address EI-CoreBioinformatics#173 (both configuration file and docs) and EI-CoreBioinformatics#158 * Fix EI-CoreBioinformatics#181 and small bug fix for parsing Mikado annotations. * Progress for EI-CoreBioinformatics#142 - this should fix the wrong ORF calculation for cases when the CDS was open at the 5' end. * Fixed previous commit (always for EI-CoreBioinformatics#142) * EI-CoreBioinformatics#142: corrected and tested the issue with one-off exons, for padding. * This should fix and test EI-CoreBioinformatics#142 for good. * Removed spurious warning/error messages * EI-CoreBioinformatics#142: solved a bug which caused truncated transcripts at the 5' end not to be padded. * EI-CoreBioinformatics#142: solved a problem which caused a false abort for transcripts on the - strand with changed stop codon. * EI-CoreBioinformatics#142: fixing previous commit * Pushing the fix for EI-CoreBioinformatics#182 onto the development branch * Fix EI-CoreBioinformatics#183 * Fix EI-CoreBioinformatics#183 and previous commit * EI-CoreBioinformatics#183: now Mikado configure will set a seed when generating the configuration file. The seed will be explicitly mentioned in the log. * EI-CoreBioinformatics#177: made ORF loading slightly faster with pysam. Also made XML serialisation much faster using SQL sessions and multiprocessing.Pool instead of queues. * Solved annoying bug that caused Mikado to crash with TAIR GFF3s. * Development (EI-CoreBioinformatics#184) * This should address EI-CoreBioinformatics#173 (both configuration file and docs) and EI-CoreBioinformatics#158 * Fix EI-CoreBioinformatics#181 and small bug fix for parsing Mikado annotations. * Progress for EI-CoreBioinformatics#142 - this should fix the wrong ORF calculation for cases when the CDS was open at the 5' end. * Fixed previous commit (always for EI-CoreBioinformatics#142) * EI-CoreBioinformatics#142: corrected and tested the issue with one-off exons, for padding. * This should fix and test EI-CoreBioinformatics#142 for good. * Removed spurious warning/error messages * EI-CoreBioinformatics#142: solved a bug which caused truncated transcripts at the 5' end not to be padded. * EI-CoreBioinformatics#142: solved a problem which caused a false abort for transcripts on the - strand with changed stop codon. * EI-CoreBioinformatics#142: fixing previous commit * Pushing the fix for EI-CoreBioinformatics#182 onto the development branch * Fix EI-CoreBioinformatics#183 * Fix EI-CoreBioinformatics#183 and previous commit * EI-CoreBioinformatics#183: now Mikado configure will set a seed when generating the configuration file. The seed will be explicitly mentioned in the log. * EI-CoreBioinformatics#177: made ORF loading slightly faster with pysam. Also made XML serialisation much faster using SQL sessions and multiprocessing.Pool instead of queues. * Solved annoying bug that caused Mikado to crash with TAIR GFF3s.

lucventurini added enhancement question labels Nov 5, 2018

lucventurini added this to the 1.3 milestone Nov 5, 2018

lucventurini assigned swarbred, gemygk and lucventurini Nov 5, 2018

lucventurini added a commit that referenced this issue Nov 9, 2018

Fixed hopefully some edge cases for #142

c2252d2

lucventurini added a commit that referenced this issue Jan 28, 2019

Made transcript padding the default action (#142)

e2d49b1

lucventurini added a commit that referenced this issue Jan 28, 2019

Improved a bit calls, as per #142 and #137

9b0a4eb

lucventurini closed this as completed Feb 21, 2019

lucventurini reopened this May 17, 2019

lucventurini closed this as completed in 3b32a01 Jun 18, 2019

lucventurini reopened this Jun 19, 2019

lucventurini closed this as completed Jul 3, 2019

lucventurini added this to Closed in Version 2 Oct 15, 2020

lucventurini added a commit to lucventurini/mikado that referenced this issue Feb 11, 2021

Solved EI-CoreBioinformatics#142

51c0fd5

lucventurini added a commit to lucventurini/mikado that referenced this issue Feb 11, 2021

Solved EI-CoreBioinformatics#142 properly - now we are correctly chec…

7caa516

…king for the interface with introns, not just exons

lucventurini added a commit to lucventurini/mikado that referenced this issue Feb 11, 2021

Fixed and tested EI-CoreBioinformatics#142

0f1062a

lucventurini added a commit to lucventurini/mikado that referenced this issue Feb 11, 2021

Solved (hopefully) EI-CoreBioinformatics#142, updated the documentati…

a774758

…on (EI-CoreBioinformatics#136)

lucventurini added a commit to lucventurini/mikado that referenced this issue Feb 11, 2021

Fixed the padding (EI-CoreBioinformatics#142) with some new tests as …

111e1cc

…well (EI-CoreBioinformatics#137)

lucventurini added a commit to lucventurini/mikado that referenced this issue Feb 11, 2021

Unit-tests (EI-CoreBioinformatics#137) seem to say that EI-CoreBioinf…

8f18b3c

…ormatics#142 should be solved.

lucventurini added a commit to lucventurini/mikado that referenced this issue Feb 11, 2021

Fixed a bug introduced by previous commit, always for EI-CoreBioinfor…

8983973

…matics#142, implemented new unit-test for EI-CoreBioinformatics#137

lucventurini added a commit to lucventurini/mikado that referenced this issue Feb 11, 2021

Solved yet another bug in EI-CoreBioinformatics#142

199ae45

lucventurini added a commit to lucventurini/mikado that referenced this issue Feb 11, 2021

Fixed hopefully some edge cases for EI-CoreBioinformatics#142

123ac2f

lucventurini added a commit to lucventurini/mikado that referenced this issue Feb 11, 2021

Another fix for EI-CoreBioinformatics#142

92ada58

lucventurini added a commit to lucventurini/mikado that referenced this issue Feb 11, 2021

Made transcript padding the default action (EI-CoreBioinformatics#142)

05123f7

lucventurini added a commit to lucventurini/mikado that referenced this issue Feb 11, 2021

Improved a bit calls, as per EI-CoreBioinformatics#142 and EI-CoreBio…

cbb419f

…informatics#137

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Smarter transcript padding #142

Smarter transcript padding #142

lucventurini commented Nov 5, 2018

lucventurini commented Nov 5, 2018

lucventurini commented Nov 6, 2018

lucventurini commented Nov 7, 2018

lucventurini commented Feb 21, 2019

lucventurini commented May 17, 2019

lucventurini commented May 17, 2019

lucventurini commented May 20, 2019 •

edited

swarbred commented May 23, 2019

lucventurini commented May 23, 2019

lucventurini commented Jun 3, 2019

lucventurini commented Jun 6, 2019

lucventurini commented Jul 1, 2019

lucventurini commented Jul 3, 2019

Smarter transcript padding #142

Smarter transcript padding #142

Comments

lucventurini commented Nov 5, 2018

lucventurini commented Nov 5, 2018

lucventurini commented Nov 6, 2018

lucventurini commented Nov 7, 2018

lucventurini commented Feb 21, 2019

lucventurini commented May 17, 2019

lucventurini commented May 17, 2019

lucventurini commented May 20, 2019 • edited

Refining the padding: a complex case

Shifting to directional graphs

swarbred commented May 23, 2019

lucventurini commented May 23, 2019

lucventurini commented Jun 3, 2019

lucventurini commented Jun 6, 2019

lucventurini commented Jul 1, 2019

lucventurini commented Jul 3, 2019

lucventurini commented May 20, 2019 •

edited