Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add prepare option to remove transcripts with the same intron chain #270

Closed
swarbred opened this issue Feb 15, 2020 · 27 comments
Closed

Add prepare option to remove transcripts with the same intron chain #270

swarbred opened this issue Feb 15, 2020 · 27 comments

Comments

@swarbred
Copy link
Collaborator

@lucventurini for possible future consideration (post 2.0)

With us now starting to use CSS reads or raw pacbio / nanopore alignments in mikado the number of reads is getting much larger.

In the context of the transcriptome module I've discussed with @ljyanesm some pragmatic solutions (e.g. stringtie2 to fillter and collapse the data, downside being this assembles reads OR collapsing with GFFread pre or post mikado).

Obviously this can be done outside of mikado but I wanted to check if it was easily feasible to do it as part of the prepare step.

The idea would be to remove transcript with identical intron chains (keeping I guess the version with the most extended terminal exons, e.g. as gffread -M), but NOT removing contained fragments of partial intron chains i.e. as gffread -M -K

That would not be advisable for illumina RNA-Seq assembled transcripts but it's less dangerous for long reads and would substantially reduce the number of transcripts for orf / blast analysis.

Ultimately there are better solutions to filter long reads prior to mikado but the above would be a useful feature (if you already hold the info required during prepare).

Note this would need to be tied to specific labels to allow this to be applied to just certain transcript sets.

@lucventurini
Copy link
Collaborator

Hi @swarbred ,
this should not be too hard to accomplish, actually. It would probably require only some fiddling around the store_transcripts function.
It is a good idea, I think. I can definitely start experimenting on this after 2.0.

lucventurini added a commit that referenced this issue Mar 21, 2020
…d on their *intron chains* / *monoexonic span overlap*, rather than start/end. Exact CDS match still applies.
@lucventurini
Copy link
Collaborator

@swarbred @ljyanesm

872dcff has the required changes. As stated in the commit message, now Mikado will remove redundant transcripts based on their intron chains / monoexonic overlap, rather than checking simply for the start/end. Exceptions; two transcripts with similar coordinates but different strand or CDS content.
Needs to be properly tested, but I think that 872dcff properly implements the required changes.

@lucventurini
Copy link
Collaborator

Note this would need to be tied to specific labels to allow this to be applied to just certain transcript sets.

@swarbred apologies, the changes in 872dcff do not distinguish between datasets. Still WIP

lucventurini added a commit to lucventurini/mikado that referenced this issue Mar 24, 2020
lucventurini added a commit to lucventurini/mikado that referenced this issue Mar 24, 2020
@lucventurini
Copy link
Collaborator

@swarbred @ljyanesm

The current commit (c563f37) has a functioning prototype of this function.
Currently missing: being able to specify which samples should see themselves pared down (and/or how). We should discuss this tomorrow in the meeting.

lucventurini added a commit to lucventurini/mikado that referenced this issue Mar 25, 2020
@lucventurini
Copy link
Collaborator

lucventurini commented Mar 25, 2020

Hi @swarbred

Commit bc0774e should add all the features that you request.
Specifically now the list of files will have the following format:

filename label strandedness score is_reference keep_redundant

The first three fields are mandatory.
Score can be omitted (default 0).
is_reference must be boolean. Default: false.
keep_redundant must be boolean. A reference transcript will have this set to True by default.

A transcript will be considered as redundant to a template if:

  • it is completely contained within the template
  • its intron chain is identical to the template (for multiexonic transcripts)
  • its CDS (if present for both) is identical to the template.
  • its strand is identical to the template.

Please note that the last point implies that some transcripts will slip through: the check on the strandedness for multiexonic transcripts happens downstream to the redundancy test (as it is a quite expensive operation).

lucventurini added a commit to lucventurini/mikado that referenced this issue Apr 7, 2020
* For EI-CoreBioinformatics#270: now `mikado prepare` will remove redundant transcripts based on their *intron chains* / *monoexonic span overlap*, rather than start/end. Exact CDS match still applies.

* For EI-CoreBioinformatics#270: this commit should introduce all the features asked by @swarbred (tagging also @ljyanesm)
lucventurini added a commit that referenced this issue Apr 7, 2020
* Issue #280: now mikado serialise has been refactored so that:
   * loading of BLAST XMLs should be faster thanks to using Cython on the most time-expensive function
   * mikado now accepts also *tabular* BLAST data (custom format, we need the `ppos` and `btop` extra fields)
   * `daijin` now automatically generates *tabular* rather than XML BLAST results
   * `mikado` will now use `spawn` as the default multiprocessing method. This avoids memory accounting problems in eg. SLURM (sometimes `fork` results in the HPC management system to think that the shared memory is duplicated, massively and falsely inflating the accounting of memory usage).
* Issue #270: now Mikado will remove redundancy based on intron chains
  * For #270: now `mikado prepare` will remove redundant transcripts based on their *intron chains* / *monoexonic span overlap*, rather than start/end. Exact CDS match still applies.
@lucventurini
Copy link
Collaborator

Fixed in b3f3d7a

@swarbred
Copy link
Collaborator Author

swarbred commented Apr 7, 2020

@lucventurini

the mikado configure --help

needs to be updated with the additional column for keep_redundant

--list LIST           Tab-delimited file containing rows with the following format
                                                <file>  <label> <strandedness> <score(optional)> <always_keep(optional)>

@lucventurini
Copy link
Collaborator

True. I thought I had done already, will do so tomorrow.

@lucventurini lucventurini reopened this Apr 7, 2020
@swarbred
Copy link
Collaborator Author

swarbred commented Apr 7, 2020

This was from the --help from 4ab8cf5

@swarbred
Copy link
Collaborator Author

Hi @lucventurini

Sorry I'm slightly confused, when I read this originally I thought you had added an extra column to the list file "keep_redundant" that was a boolean for whether to collapse transcripts contained in another model with the default being TRUE i.e. only collapsing exact matching transcripts. Looking at the the prepare --help description for list this indicates

<score(optional)> <is_reference(optional)><always_keep(optional)

the help from an earlier version before this change

<score(optional)> <always_keep(optional)>

so the new column is "is_reference" as we had always_keep in the earlier version

What you called "keep_redundant" looks to map to <always_keep(optional)>

if so it seems we have switched the meaning of always_keep in the earlier version to now be is_reference.

is_reference = flag which models are to be regarded as reference transcripts (this seems clear from the name, and I prefer using this as a column name than always_keep)

always_keep = I guess if you take always_keep to be short for "always keep contained" then it makes sense what confused me is that we seem to have switched always_keep to be is_reference. I understand what matters is the column order (and the marking of reference transcripts is still col 5 across versions).

As long as the doc makes clear the reason for the different columns and that the collapsing of contained models is separate from the marking of reference transcripts with the default NOT to collapse contained models then this seems fine.

@lucventurini
Copy link
Collaborator

lucventurini commented Apr 15, 2020

@swarbred

You are right that the documentation is confusing. This will be the main target for modifications before releasing 2.0 ...

But onto how it should work:

Sorry I'm slightly confused, when I read this originally I thought you had added an extra column to the list file "keep_redundant" that was a boolean for whether to collapse transcripts contained in another model with the default being TRUE i.e. only collapsing exact matching transcripts. Looking at the the prepare --help description for list this indicates

Slightly incorrect. The choice is between either not collapsing (keep_redundant: True) or collapsing matching, contained intron chains (keep_redundant: False). The default in the RC is False ie remove redundancies. That's why the command line switch now is --keep-redundant.

What you called "keep_redundant" looks to map to <always_keep(optional)>
if so it seems we have switched the meaning of always_keep in the earlier version to now be is_reference.
is_reference = flag which models are to be regarded as reference transcripts (this seems clear from the name, and I prefer using this as a column name than always_keep)

Yes, that is correct. Internally I still use the "reference" moniker; I will amend accordingly.

Moreover, I just realised that unfortunately the parsing of the input list of files (either in configure or prepare was broken).

As long as the doc makes clear the reason for the different columns and that the collapsing of contained models is separate from the marking of reference transcripts with the default NOT to collapse contained models then this seems fine.

Yes, we will amend the documentation. I have also changed things so that the default now is to not reduce redundancy unless specified.

Please see commits e57aa8a and f6ea9a1.

@lucventurini
Copy link
Collaborator

@swarbred

My reference to "TRUE i.e. only collapsing exact matching transcripts" was that we would not collapse the contained intron chains but just apply the normal removing of exact matching transcripts.

To clarify: the option of removing exactly matching transcripts is not present. Either all transcripts are left in or we remove redundant intron chains. My fault, I understood that we wanted to supersede the behaviour, rather than having three possible options instead of two (keep everything, remove identical transcripts, remove redundant intron chains).

I don't think we should be as default collapsing based on intron chain given that we have not really tested this and there is the potential for this to have a negative effect (i.e. single exon genes adjacent to multi exonic genes being merged together in an assembled transcript).

This will not happen in any circumstance. Mikado removes transcripts that have the same intron chain. So if a transcript with n - 1 introns is completely contained within a transcript that has n introns, both will be kept.

So in a run with mikado-2.0pcr1 where I have not populated this 6th column in the list file i was running with a default of False i.e. I was collapsing all transcripts that were contained?

Yes, I fear so. Unfortunately, and very embarassingly for me, I realised this morning that the parsing of the list file was broken.

Seems safest at least until we start to make use of this and get a better idea about any downsides. So what does this mean for your command line option --keep-redundant if now the default is already to keep redundant transcripts. Also this only refers to the contained intron chains we are still removing transcripts where the coordinates exactly match.

The --keep-redundant switch would supersede anything in the configuration file. So if this is invoked, all files will have all transcripts kept in, regardless of what was written in the list of files provided to configure.

@lucventurini
Copy link
Collaborator

lucventurini commented Apr 15, 2020

Hopefully this will help clear things up:
Notes_200415_150743_ad6_1

In this locus:

  • A' is redundant with A and will be removed.
  • B is not redundant with A, or viceversa, because it is not fully contained.
  • C is not redundant with A, A' or B, because it has one intron less than any of these three.
  • C' is redundant with C and will be removed (for the same reasons as A' is redundant with A)
  • D is not redundant with any multiexonic transcript.
  • D' is redundant with D, because it is fully contained.
  • D and E are not redundant because neither is fully contained in the other

I hope this makes the logic a little bit clearer?

@swarbred
Copy link
Collaborator Author

swarbred commented Apr 15, 2020

My fault, I understood that we wanted to supersede the behaviour, rather than having three possible options instead of two (keep everything, remove identical transcripts, remove redundant intron chains).

This is what I was envisaging

no changes to is_reference i.e. any model marked as reference is retained (strictly we may have some exceptions i.e. where models fail some criteria i.e. CDS with internal stops, whatever we are currently doing seems fine).

Outside of reference models I cant see a reason why we would ever want to keep truly identical transcripts (there is no benefit in mikado pick if the models are really identical as far as I can see) so I wasn't thinking we would change this behaviour (i.e. I expected these always to be removed unless the model is a reference) and the user wouldn't need any control of this.

I was thinking we would just add a column to the list file to enable the control over removing redundant intron chains.

so the default behaviour would be as in previous versions i.e. we only remove identical models unless marked as reference but we would have the option to be more aggressive and remove redundant intron chains (again excluding models marked as reference).

This will not happen in any circumstance. Mikado removes transcripts that have the same intron chain. So if a transcript with n - 1 introns is completely contained within a transcript that has n introns, both will be kept.

My example would be like this, two adjacent genes, 1 single exon 1 multiexonic as shown in the REF annotation (i.e. what is the correct annotation). But the RNA assemblies generated look like the three transcript assemblies (in transcript 2, there is low coverage between the two genes which results in a single fused transcript). Transcript 2 and 3 are the same relationship as your A and A` example

REF annotation XXXXXX XXX---XXX---XXXX
Transcript 1 XXXXX
Transcript 2 XXXXXXXXXXX---XXX---XXXX
Transcript 3 XXX---XXX---XXXX

In this case I believe if you run with the new collapse intron chain logic we will end up with just two transcripts 1 and 2.

So we had the "correct" answer in the input but lose 1 of these correct transcripts in the prepare output if we collapse on intron chains. This would be one case where collapsing on intron chain would be detrimental (hence why I wouldn't do this as default for illumina assemblies).

The above is going to be less of an issue when you have full-length transcripts rather than assembled rna-seq.

@swarbred
Copy link
Collaborator Author

Sorry my attempt at the diagram gets messed up on posting, I will post the image

@swarbred
Copy link
Collaborator Author

Screenshot 2020-04-15 at 17 41 06

@lucventurini
Copy link
Collaborator

So we had the "correct" answer in the input but lose 1 of these correct transcripts in the prepare output if we collapse on intron chains. This would be one case where collapsing on intron chain would be detrimental (hence why I wouldn't do this as default for illumina assemblies).

I see your point. I would hope that this specific instance would be caught by chimera splitting, but that's just quibbling - I can see the reason to keep the behaviour as keeping redundancies.

so the default behaviour would be as in previous versions i.e. we only remove identical models unless marked as reference but we would have the option to be more aggressive and remove redundant intron chains (again excluding models marked as reference).

Yes, this is achievable. I will reopen the issue-270 branch ...

@lucventurini
Copy link
Collaborator

@swarbred

b552dfc and 657b9c7 should address this problem. Now redundant transcripts (ie, non-reference identical copies, down to the base and including a comparison of the CDS) are always de-duplicated by keeping only one copy (giving preference to reference transcripts first, to higher-scoring sources second).

@swarbred
Copy link
Collaborator Author

@lucventurini
Thanks we have few hpc issues currently but i will run on the test region when I can

@swarbred
Copy link
Collaborator Author

@lucventurini

I'm just coming back to this after some time (we had hpc issues at the time of original email and the 2.0prc2 version i ran didn't complete the prepare stage)

It looks like the modified behaviour is default on in 2.0prc2

e.g. 487445 seqs after prepare with d094f99

275429 seqs after prepare with 2.0prc2

my intention was to have this as an option for the special cases when you need to more aggressively remove models with the same intron chain i.e. very large data sets thinking mainly of pacbio CCS or ONT reads not as a default for the more common case of normal illumina assemblies.

It certainly cuts down the model numbers so speeds down stream steps but I suspect it is detrimental for illumina assemblies. Whether it's default on or off I want to know how to switch to the original behaviour.

Did we make a change to NOT have the new behaviour as default after the version i'm running or is this still the default?

To clarify is this how it currently works

with --keep-redundant TRUE , we remove only fully redundant models (i.e. match end to end) unless they are marked as reference i.e. always keep (i.e. the old behaviour)

with --keep-redundant FALSE , we remove models with redundant intron chains unless they are marked as reference (i.e. the new behaviour)

From the configration.toml file that was generated the default appears to be keep_redundant = false which would fit with the above.

@lucventurini
Copy link
Collaborator

Hi @swarbred

Yes, your understanding is exactly right.

I can switch back the behaviour so that the flag can be "--exclude-redundant" if that is the preferred case. It will require a bit of fiddling as some automated tests rely on the redundancy removal.

Best

@swarbred
Copy link
Collaborator Author

swarbred commented Oct 1, 2020

Hi @lucventurini when you have access see JIRA ticket GENANNO-480 and my comment from 28th sept 12:06 and the browser link which is shown in the comment below that.

There will be some cases where the more aggressive removal results in an improved selection but I think there are more times where this is undesirable (specifically dealing with illumina assemblies) e.g. the screen shot below where the more aggressive redundancy removal results in only the over extended model being selected (this is not stranded data). The top track is with --keep-redundant FALSE the second track with = TRUE

Screenshot 2020-09-28 at 15 07 33

So I would favour changing to --exclude-redundant

@lucventurini
Copy link
Collaborator

Hi @swarbred,

I am proceeding to change the interface as you requested. Hopefully I should be done by this afternoon.

As the only change will be the direction of the switch, not the behaviour itself, it should be possible to quickly test this and then merge into master?

lucventurini added a commit to lucventurini/mikado that referenced this issue Oct 2, 2020
…ant' has become 'exclude_redundant' and the default behaviour is, like in Mikado 1.0, to keep redundant models unless specified otherwise. Also changed the minimum biopython version (1.78) and made small adjustments as the new library version has removed support for 'Alphabet'.
@lucventurini
Copy link
Collaborator

Hi @swarbred @gemygk

Travis is testing the latest commit in the branch (https://travis-ci.org/github/lucventurini/mikado/jobs/733038131) which should function for Python 3.6, 3.7 and 3.8.

I will sign it off here and leaving this commit, 2a61631, for you to test.

Please let me know if there is anything else I can do on this.

@swarbred
Copy link
Collaborator Author

swarbred commented Oct 7, 2020

@lucventurini can confirm 2a61631 works as expected

@lucventurini
Copy link
Collaborator

@swarbred

Excellent! Merging into master and closing!

@lucventurini lucventurini mentioned this issue Oct 7, 2020
@lucventurini lucventurini reopened this Oct 7, 2020
lucventurini added a commit to lucventurini/mikado that referenced this issue Feb 11, 2021
…ioinformatics#305)

* Issue EI-CoreBioinformatics#280: now mikado serialise has been refactored so that:
   * loading of BLAST XMLs should be faster thanks to using Cython on the most time-expensive function
   * mikado now accepts also *tabular* BLAST data (custom format, we need the `ppos` and `btop` extra fields)
   * `daijin` now automatically generates *tabular* rather than XML BLAST results
   * `mikado` will now use `spawn` as the default multiprocessing method. This avoids memory accounting problems in eg. SLURM (sometimes `fork` results in the HPC management system to think that the shared memory is duplicated, massively and falsely inflating the accounting of memory usage).
* Issue EI-CoreBioinformatics#270: now Mikado will remove redundancy based on intron chains
  * For EI-CoreBioinformatics#270: now `mikado prepare` will remove redundant transcripts based on their *intron chains* / *monoexonic span overlap*, rather than start/end. Exact CDS match still applies.
lucventurini added a commit to lucventurini/mikado that referenced this issue Feb 11, 2021
lucventurini added a commit to lucventurini/mikado that referenced this issue Feb 11, 2021
lucventurini added a commit to lucventurini/mikado that referenced this issue Feb 11, 2021
lucventurini added a commit to lucventurini/mikado that referenced this issue Feb 11, 2021
lucventurini added a commit to lucventurini/mikado that referenced this issue Feb 11, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants