Add prepare option to remove transcripts with the same intron chain #270

swarbred · 2020-02-15T10:12:29Z

@lucventurini for possible future consideration (post 2.0)

With us now starting to use CSS reads or raw pacbio / nanopore alignments in mikado the number of reads is getting much larger.

In the context of the transcriptome module I've discussed with @ljyanesm some pragmatic solutions (e.g. stringtie2 to fillter and collapse the data, downside being this assembles reads OR collapsing with GFFread pre or post mikado).

Obviously this can be done outside of mikado but I wanted to check if it was easily feasible to do it as part of the prepare step.

The idea would be to remove transcript with identical intron chains (keeping I guess the version with the most extended terminal exons, e.g. as gffread -M), but NOT removing contained fragments of partial intron chains i.e. as gffread -M -K

That would not be advisable for illumina RNA-Seq assembled transcripts but it's less dangerous for long reads and would substantially reduce the number of transcripts for orf / blast analysis.

Ultimately there are better solutions to filter long reads prior to mikado but the above would be a useful feature (if you already hold the info required during prepare).

Note this would need to be tied to specific labels to allow this to be applied to just certain transcript sets.

lucventurini · 2020-02-15T10:25:03Z

Hi @swarbred ,
this should not be too hard to accomplish, actually. It would probably require only some fiddling around the store_transcripts function.
It is a good idea, I think. I can definitely start experimenting on this after 2.0.

…d on their *intron chains* / *monoexonic span overlap*, rather than start/end. Exact CDS match still applies.

lucventurini · 2020-03-21T12:21:24Z

@swarbred @ljyanesm

872dcff has the required changes. As stated in the commit message, now Mikado will remove redundant transcripts based on their intron chains / monoexonic overlap, rather than checking simply for the start/end. Exceptions; two transcripts with similar coordinates but different strand or CDS content.
Needs to be properly tested, but I think that 872dcff properly implements the required changes.

lucventurini · 2020-03-21T15:21:34Z

Note this would need to be tied to specific labels to allow this to be applied to just certain transcript sets.

@swarbred apologies, the changes in 872dcff do not distinguish between datasets. Still WIP

lucventurini · 2020-03-24T15:51:06Z

@swarbred @ljyanesm

The current commit (c563f37) has a functioning prototype of this function.
Currently missing: being able to specify which samples should see themselves pared down (and/or how). We should discuss this tomorrow in the meeting.

@swarbred

…eatures asked by @swarbred (tagging also @ljyanesm)

lucventurini · 2020-03-25T12:34:32Z

Hi @swarbred

Commit bc0774e should add all the features that you request.
Specifically now the list of files will have the following format:

filename label strandedness score is_reference keep_redundant

The first three fields are mandatory.
Score can be omitted (default 0).
is_reference must be boolean. Default: false.
keep_redundant must be boolean. A reference transcript will have this set to True by default.

A transcript will be considered as redundant to a template if:

it is completely contained within the template
its intron chain is identical to the template (for multiexonic transcripts)
its CDS (if present for both) is identical to the template.
its strand is identical to the template.

Please note that the last point implies that some transcripts will slip through: the check on the strandedness for multiexonic transcripts happens downstream to the redundancy test (as it is a quite expensive operation).

@swarbred

* For EI-CoreBioinformatics#270: now `mikado prepare` will remove redundant transcripts based on their *intron chains* / *monoexonic span overlap*, rather than start/end. Exact CDS match still applies. * For EI-CoreBioinformatics#270: this commit should introduce all the features asked by @swarbred (tagging also @ljyanesm)

* Issue #280: now mikado serialise has been refactored so that: * loading of BLAST XMLs should be faster thanks to using Cython on the most time-expensive function * mikado now accepts also *tabular* BLAST data (custom format, we need the `ppos` and `btop` extra fields) * `daijin` now automatically generates *tabular* rather than XML BLAST results * `mikado` will now use `spawn` as the default multiprocessing method. This avoids memory accounting problems in eg. SLURM (sometimes `fork` results in the HPC management system to think that the shared memory is duplicated, massively and falsely inflating the accounting of memory usage). * Issue #270: now Mikado will remove redundancy based on intron chains * For #270: now `mikado prepare` will remove redundant transcripts based on their *intron chains* / *monoexonic span overlap*, rather than start/end. Exact CDS match still applies.

lucventurini · 2020-04-07T08:12:25Z

Fixed in b3f3d7a

swarbred · 2020-04-07T20:55:50Z

@lucventurini

the mikado configure --help

needs to be updated with the additional column for keep_redundant

--list LIST           Tab-delimited file containing rows with the following format
                                                <file>  <label> <strandedness> <score(optional)> <always_keep(optional)>

lucventurini · 2020-04-07T21:22:45Z

True. I thought I had done already, will do so tomorrow.

swarbred · 2020-04-07T21:25:33Z

This was from the --help from 4ab8cf5

swarbred · 2020-04-15T08:25:08Z

Hi @lucventurini

Sorry I'm slightly confused, when I read this originally I thought you had added an extra column to the list file "keep_redundant" that was a boolean for whether to collapse transcripts contained in another model with the default being TRUE i.e. only collapsing exact matching transcripts. Looking at the the prepare --help description for list this indicates

<score(optional)> <is_reference(optional)><always_keep(optional)

the help from an earlier version before this change

<score(optional)> <always_keep(optional)>

so the new column is "is_reference" as we had always_keep in the earlier version

What you called "keep_redundant" looks to map to <always_keep(optional)>

if so it seems we have switched the meaning of always_keep in the earlier version to now be is_reference.

is_reference = flag which models are to be regarded as reference transcripts (this seems clear from the name, and I prefer using this as a column name than always_keep)

always_keep = I guess if you take always_keep to be short for "always keep contained" then it makes sense what confused me is that we seem to have switched always_keep to be is_reference. I understand what matters is the column order (and the marking of reference transcripts is still col 5 across versions).

As long as the doc makes clear the reason for the different columns and that the collapsing of contained models is separate from the marking of reference transcripts with the default NOT to collapse contained models then this seems fine.

lucventurini · 2020-04-15T12:10:07Z

@swarbred

You are right that the documentation is confusing. This will be the main target for modifications before releasing 2.0 ...

But onto how it should work:

Sorry I'm slightly confused, when I read this originally I thought you had added an extra column to the list file "keep_redundant" that was a boolean for whether to collapse transcripts contained in another model with the default being TRUE i.e. only collapsing exact matching transcripts. Looking at the the prepare --help description for list this indicates

Slightly incorrect. The choice is between either not collapsing (keep_redundant: True) or collapsing matching, contained intron chains (keep_redundant: False). The default in the RC is False ie remove redundancies. That's why the command line switch now is --keep-redundant.

What you called "keep_redundant" looks to map to <always_keep(optional)>
if so it seems we have switched the meaning of always_keep in the earlier version to now be is_reference.
is_reference = flag which models are to be regarded as reference transcripts (this seems clear from the name, and I prefer using this as a column name than always_keep)

Yes, that is correct. Internally I still use the "reference" moniker; I will amend accordingly.

Moreover, I just realised that unfortunately the parsing of the input list of files (either in configure or prepare was broken).

As long as the doc makes clear the reason for the different columns and that the collapsing of contained models is separate from the marking of reference transcripts with the default NOT to collapse contained models then this seems fine.

Yes, we will amend the documentation. I have also changed things so that the default now is to not reduce redundancy unless specified.

Please see commits e57aa8a and f6ea9a1.

lucventurini · 2020-04-15T13:50:24Z

@swarbred

My reference to "TRUE i.e. only collapsing exact matching transcripts" was that we would not collapse the contained intron chains but just apply the normal removing of exact matching transcripts.

To clarify: the option of removing exactly matching transcripts is not present. Either all transcripts are left in or we remove redundant intron chains. My fault, I understood that we wanted to supersede the behaviour, rather than having three possible options instead of two (keep everything, remove identical transcripts, remove redundant intron chains).

I don't think we should be as default collapsing based on intron chain given that we have not really tested this and there is the potential for this to have a negative effect (i.e. single exon genes adjacent to multi exonic genes being merged together in an assembled transcript).

This will not happen in any circumstance. Mikado removes transcripts that have the same intron chain. So if a transcript with n - 1 introns is completely contained within a transcript that has n introns, both will be kept.

So in a run with mikado-2.0pcr1 where I have not populated this 6th column in the list file i was running with a default of False i.e. I was collapsing all transcripts that were contained?

Yes, I fear so. Unfortunately, and very embarassingly for me, I realised this morning that the parsing of the list file was broken.

Seems safest at least until we start to make use of this and get a better idea about any downsides. So what does this mean for your command line option --keep-redundant if now the default is already to keep redundant transcripts. Also this only refers to the contained intron chains we are still removing transcripts where the coordinates exactly match.

The --keep-redundant switch would supersede anything in the configuration file. So if this is invoked, all files will have all transcripts kept in, regardless of what was written in the list of files provided to configure.

lucventurini · 2020-04-15T14:09:34Z

Hopefully this will help clear things up:

In this locus:

A' is redundant with A and will be removed.
B is not redundant with A, or viceversa, because it is not fully contained.
C is not redundant with A, A' or B, because it has one intron less than any of these three.
C' is redundant with C and will be removed (for the same reasons as A' is redundant with A)
D is not redundant with any multiexonic transcript.
D' is redundant with D, because it is fully contained.
D and E are not redundant because neither is fully contained in the other

I hope this makes the logic a little bit clearer?

swarbred · 2020-04-15T16:38:38Z

My fault, I understood that we wanted to supersede the behaviour, rather than having three possible options instead of two (keep everything, remove identical transcripts, remove redundant intron chains).

This is what I was envisaging

no changes to is_reference i.e. any model marked as reference is retained (strictly we may have some exceptions i.e. where models fail some criteria i.e. CDS with internal stops, whatever we are currently doing seems fine).

Outside of reference models I cant see a reason why we would ever want to keep truly identical transcripts (there is no benefit in mikado pick if the models are really identical as far as I can see) so I wasn't thinking we would change this behaviour (i.e. I expected these always to be removed unless the model is a reference) and the user wouldn't need any control of this.

I was thinking we would just add a column to the list file to enable the control over removing redundant intron chains.

so the default behaviour would be as in previous versions i.e. we only remove identical models unless marked as reference but we would have the option to be more aggressive and remove redundant intron chains (again excluding models marked as reference).

This will not happen in any circumstance. Mikado removes transcripts that have the same intron chain. So if a transcript with n - 1 introns is completely contained within a transcript that has n introns, both will be kept.

My example would be like this, two adjacent genes, 1 single exon 1 multiexonic as shown in the REF annotation (i.e. what is the correct annotation). But the RNA assemblies generated look like the three transcript assemblies (in transcript 2, there is low coverage between the two genes which results in a single fused transcript). Transcript 2 and 3 are the same relationship as your A and A` example

REF annotation XXXXXX XXX---XXX---XXXX
Transcript 1 XXXXX
Transcript 2 XXXXXXXXXXX---XXX---XXXX
Transcript 3 XXX---XXX---XXXX

In this case I believe if you run with the new collapse intron chain logic we will end up with just two transcripts 1 and 2.

So we had the "correct" answer in the input but lose 1 of these correct transcripts in the prepare output if we collapse on intron chains. This would be one case where collapsing on intron chain would be detrimental (hence why I wouldn't do this as default for illumina assemblies).

The above is going to be less of an issue when you have full-length transcripts rather than assembled rna-seq.

swarbred · 2020-04-15T16:40:49Z

Sorry my attempt at the diagram gets messed up on posting, I will post the image

swarbred · 2020-04-15T16:41:26Z

lucventurini · 2020-04-15T16:52:39Z

So we had the "correct" answer in the input but lose 1 of these correct transcripts in the prepare output if we collapse on intron chains. This would be one case where collapsing on intron chain would be detrimental (hence why I wouldn't do this as default for illumina assemblies).

I see your point. I would hope that this specific instance would be caught by chimera splitting, but that's just quibbling - I can see the reason to keep the behaviour as keeping redundancies.

so the default behaviour would be as in previous versions i.e. we only remove identical models unless marked as reference but we would have the option to be more aggressive and remove redundant intron chains (again excluding models marked as reference).

Yes, this is achievable. I will reopen the issue-270 branch ...

…ates (#270)

lucventurini · 2020-04-16T11:20:18Z

@swarbred

b552dfc and 657b9c7 should address this problem. Now redundant transcripts (ie, non-reference identical copies, down to the base and including a comparison of the CDS) are always de-duplicated by keeping only one copy (giving preference to reference transcripts first, to higher-scoring sources second).

swarbred · 2020-04-16T16:15:12Z

@lucventurini
Thanks we have few hpc issues currently but i will run on the test region when I can

swarbred · 2020-09-25T15:42:02Z

@lucventurini

I'm just coming back to this after some time (we had hpc issues at the time of original email and the 2.0prc2 version i ran didn't complete the prepare stage)

It looks like the modified behaviour is default on in 2.0prc2

e.g. 487445 seqs after prepare with d094f99

275429 seqs after prepare with 2.0prc2

my intention was to have this as an option for the special cases when you need to more aggressively remove models with the same intron chain i.e. very large data sets thinking mainly of pacbio CCS or ONT reads not as a default for the more common case of normal illumina assemblies.

It certainly cuts down the model numbers so speeds down stream steps but I suspect it is detrimental for illumina assemblies. Whether it's default on or off I want to know how to switch to the original behaviour.

Did we make a change to NOT have the new behaviour as default after the version i'm running or is this still the default?

To clarify is this how it currently works

with --keep-redundant TRUE , we remove only fully redundant models (i.e. match end to end) unless they are marked as reference i.e. always keep (i.e. the old behaviour)

with --keep-redundant FALSE , we remove models with redundant intron chains unless they are marked as reference (i.e. the new behaviour)

From the configration.toml file that was generated the default appears to be keep_redundant = false which would fit with the above.

lucventurini · 2020-10-01T09:09:13Z

Hi @swarbred

Yes, your understanding is exactly right.

I can switch back the behaviour so that the flag can be "--exclude-redundant" if that is the preferred case. It will require a bit of fiddling as some automated tests rely on the redundancy removal.

Best

swarbred · 2020-10-01T16:37:57Z

Hi @lucventurini when you have access see JIRA ticket GENANNO-480 and my comment from 28th sept 12:06 and the browser link which is shown in the comment below that.

There will be some cases where the more aggressive removal results in an improved selection but I think there are more times where this is undesirable (specifically dealing with illumina assemblies) e.g. the screen shot below where the more aggressive redundancy removal results in only the over extended model being selected (this is not stranded data). The top track is with --keep-redundant FALSE the second track with = TRUE

So I would favour changing to --exclude-redundant

lucventurini · 2020-10-02T09:35:25Z

Hi @swarbred,

I am proceeding to change the interface as you requested. Hopefully I should be done by this afternoon.

As the only change will be the direction of the switch, not the behaviour itself, it should be possible to quickly test this and then merge into master?

…ant' has become 'exclude_redundant' and the default behaviour is, like in Mikado 1.0, to keep redundant models unless specified otherwise. Also changed the minimum biopython version (1.78) and made small adjustments as the new library version has removed support for 'Alphabet'.

lucventurini · 2020-10-05T17:12:21Z

Hi @swarbred @gemygk

Travis is testing the latest commit in the branch (https://travis-ci.org/github/lucventurini/mikado/jobs/733038131) which should function for Python 3.6, 3.7 and 3.8.

I will sign it off here and leaving this commit, 2a61631, for you to test.

Please let me know if there is anything else I can do on this.

swarbred · 2020-10-07T12:59:09Z

@lucventurini can confirm 2a61631 works as expected

lucventurini · 2020-10-07T13:13:04Z

@swarbred

Excellent! Merging into master and closing!

…ioinformatics#305) * Issue EI-CoreBioinformatics#280: now mikado serialise has been refactored so that: * loading of BLAST XMLs should be faster thanks to using Cython on the most time-expensive function * mikado now accepts also *tabular* BLAST data (custom format, we need the `ppos` and `btop` extra fields) * `daijin` now automatically generates *tabular* rather than XML BLAST results * `mikado` will now use `spawn` as the default multiprocessing method. This avoids memory accounting problems in eg. SLURM (sometimes `fork` results in the HPC management system to think that the shared memory is duplicated, massively and falsely inflating the accounting of memory usage). * Issue EI-CoreBioinformatics#270: now Mikado will remove redundancy based on intron chains * For EI-CoreBioinformatics#270: now `mikado prepare` will remove redundant transcripts based on their *intron chains* / *monoexonic span overlap*, rather than start/end. Exact CDS match still applies.

…I-CoreBioinformatics#291. Ready to roll into Conda etc.

…ates (EI-CoreBioinformatics#270)

swarbred added enhancement question labels Feb 15, 2020

lucventurini assigned lucventurini and ljyanesm Feb 15, 2020

lucventurini mentioned this issue Feb 15, 2020

Allow stripping of CDS features in pick #271

Closed

lucventurini mentioned this issue Mar 11, 2020

stuck at some point during mikado pick #295

Closed

lucventurini mentioned this issue Mar 20, 2020

Mikado serialize running time #280

Closed

lucventurini added a commit that referenced this issue Mar 21, 2020

For #270: now mikado prepare will remove redundant transcripts base…

872dcff

…d on their *intron chains* / *monoexonic span overlap*, rather than start/end. Exact CDS match still applies.

lucventurini added a commit to lucventurini/mikado that referenced this issue Mar 24, 2020

Some bugfixing. Not finished yet (EI-CoreBioinformatics#270)

44f73be

lucventurini added a commit to lucventurini/mikado that referenced this issue Mar 24, 2020

For EI-CoreBioinformatics#270: fixed a bug and unit tests

c563f37

lucventurini added a commit to lucventurini/mikado that referenced this issue Mar 25, 2020

For EI-CoreBioinformatics#270: this commit should introduce all the f…

bc0774e

…eatures asked by @swarbred (tagging also @ljyanesm)

lucventurini mentioned this issue Mar 25, 2020

Which mikado version to use? #299

Closed

lucventurini closed this as completed Apr 7, 2020

lucventurini reopened this Apr 7, 2020

lucventurini closed this as completed in 8976f1b Apr 8, 2020

lucventurini added a commit that referenced this issue Apr 15, 2020

Fix the parsing of the list.txt file (for #270)

e57aa8a

lucventurini added a commit that referenced this issue Apr 15, 2020

Amending docs and CLI (for #270)

f6ea9a1

lucventurini closed this as completed in b552dfc Apr 15, 2020

lucventurini reopened this Apr 15, 2020

lucventurini added a commit that referenced this issue Apr 16, 2020

Now mikado prepare will always remove identical, non-reference duplic…

657b9c7

…ates (#270)

lucventurini mentioned this issue Oct 7, 2020

Issue 270 #326

Closed

lucventurini closed this as completed in 9b04246 Oct 7, 2020

lucventurini reopened this Oct 7, 2020

lucventurini closed this as completed in 818a601 Oct 7, 2020

lucventurini added a commit to lucventurini/mikado that referenced this issue Feb 11, 2021

Update the changelog and version; fix EI-CoreBioinformatics#270 and E…

840aeee

…I-CoreBioinformatics#291. Ready to roll into Conda etc.

lucventurini added a commit to lucventurini/mikado that referenced this issue Feb 11, 2021

Fix the parsing of the list.txt file (for EI-CoreBioinformatics#270)

3537e81

lucventurini added a commit to lucventurini/mikado that referenced this issue Feb 11, 2021

Amending docs and CLI (for EI-CoreBioinformatics#270)

e45471c

lucventurini added a commit to lucventurini/mikado that referenced this issue Feb 11, 2021

Fix EI-CoreBioinformatics#270 with changes asked by @swarbred.

92a7d18

lucventurini added a commit to lucventurini/mikado that referenced this issue Feb 11, 2021

Now mikado prepare will always remove identical, non-reference duplic…

a12b3c3

…ates (EI-CoreBioinformatics#270)

lucventurini added a commit to lucventurini/mikado that referenced this issue Feb 11, 2021

Merge issue-270 and fix EI-CoreBioinformatics#270

44c452a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add prepare option to remove transcripts with the same intron chain #270

Add prepare option to remove transcripts with the same intron chain #270

swarbred commented Feb 15, 2020

lucventurini commented Feb 15, 2020

lucventurini commented Mar 21, 2020

lucventurini commented Mar 21, 2020

lucventurini commented Mar 24, 2020

lucventurini commented Mar 25, 2020 •

edited

lucventurini commented Apr 7, 2020

swarbred commented Apr 7, 2020

lucventurini commented Apr 7, 2020

swarbred commented Apr 7, 2020

swarbred commented Apr 15, 2020

lucventurini commented Apr 15, 2020 •

edited

lucventurini commented Apr 15, 2020

lucventurini commented Apr 15, 2020 •

edited

swarbred commented Apr 15, 2020 •

edited

swarbred commented Apr 15, 2020

swarbred commented Apr 15, 2020

lucventurini commented Apr 15, 2020

lucventurini commented Apr 16, 2020

swarbred commented Apr 16, 2020

swarbred commented Sep 25, 2020

lucventurini commented Oct 1, 2020

swarbred commented Oct 1, 2020

lucventurini commented Oct 2, 2020

lucventurini commented Oct 5, 2020

swarbred commented Oct 7, 2020

lucventurini commented Oct 7, 2020

Add prepare option to remove transcripts with the same intron chain #270

Add prepare option to remove transcripts with the same intron chain #270

Comments

swarbred commented Feb 15, 2020

lucventurini commented Feb 15, 2020

lucventurini commented Mar 21, 2020

lucventurini commented Mar 21, 2020

lucventurini commented Mar 24, 2020

lucventurini commented Mar 25, 2020 • edited

lucventurini commented Apr 7, 2020

swarbred commented Apr 7, 2020

lucventurini commented Apr 7, 2020

swarbred commented Apr 7, 2020

swarbred commented Apr 15, 2020

lucventurini commented Apr 15, 2020 • edited

lucventurini commented Apr 15, 2020

lucventurini commented Apr 15, 2020 • edited

swarbred commented Apr 15, 2020 • edited

swarbred commented Apr 15, 2020

swarbred commented Apr 15, 2020

lucventurini commented Apr 15, 2020

lucventurini commented Apr 16, 2020

swarbred commented Apr 16, 2020

swarbred commented Sep 25, 2020

lucventurini commented Oct 1, 2020

swarbred commented Oct 1, 2020

lucventurini commented Oct 2, 2020

lucventurini commented Oct 5, 2020

swarbred commented Oct 7, 2020

lucventurini commented Oct 7, 2020

lucventurini commented Mar 25, 2020 •

edited

lucventurini commented Apr 15, 2020 •

edited

lucventurini commented Apr 15, 2020 •

edited

swarbred commented Apr 15, 2020 •

edited