Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed bugs and enhanced isoform quantification #181

Closed
wants to merge 8 commits into from

Conversation

ctl43
Copy link

@ctl43 ctl43 commented Oct 26, 2021

Thanks for creating this wonderful package. While I used the program, I found few problems and made some updates for fixing bugs as well as a potentially better isoform choosing logic when stringent option is on.

Mirror things:

  1. Currently, Flair is unable to handle soft-masked reference genome and it is fixed.
    See issue (Unable to handle soft-masked reference genome #168).

  2. Retained all intermediate files for troubleshoot when keep_intermediate is on in flair.py, including args.o+'firstpass.fa', alignout+'q.counts'.

  3. Fixed mm2_args problem but not just removing the argument, see issue (Flair collapse stopping without error #175).

Major bugs:

  1. The argument of minimum supporting reads in the collapse function was not passed to collapse_isoforms_precise.py, so some TSSs with sufficient support were also filtered. Currently, collapse_isoforms_precise.py will only choose TSS supported by 25% of reads. This value is also not the same as the value in its help manual, which stated minimum proportion(s<1) or number of supporting reads(s>=1) for an isoform (0.1).

  2. When splicing donor and acceptor are at the same site, object ssData in ssPrep.py only stored the one that appeared first. Although this rarely happens, it does appear in real-life annotation (e.g. TMEM141 in gencode) and filtered those promising isoforms. This was fixed by storing the information of exon ends into two objects, ssData_1 and ssData_2.

  3. When stringent argument is on, it will only select reads that cover 25 bp of the first and last exon. However, in count_sam_transcripts.py, it seems that it forgot to consider the strand information when determining the first blocksize and last_blocksize. If a gene is negative-stranded, the exon on the right side is the first exon, while the exon on the left side is the first exon. However, the script treated all exons on the left as the first exons and exons on the right as the last exons regardless of their strand. Unsurprisingly, the stringent classifier function, is_stringent will not work properly. As a result, reads not covering 25 bp of the first and last exon also passed the classifier.

Enhancement

Personally found that the transcript quantification does not work very well in both flair own quantification and salmon. For salmon, it needs specific arguments when handling ONT sequencing reads see the issue in salmon github, COMBINE-lab/salmon#602. For flair-own quantification, when the stringent argument is on, in flair.py, after samtools processed reads filtered reads based on mapping quality, count_sam_transcripts.py will no longer take mapping quality into consideration for finding the best alignment (Even one alignment is far better than the chosen one).

The culprit could be the metrics it used. When an alignment is closer to the transcript end and possesses more nucleotide matches, it will be considered as a better one. However, this metric may fail to choose the best one when a read has more matches but percentage identity (PID) in the aligned region is poor. If we consider the PID in the aligned region as well as mapping quality, it may help to have a better transcript assignment for reads.

Furthermore, it may be beneficial if we create two kinds of filters for mapping quality, one is a hard mapping quality filter, like what flair currently using, another one is a soft mapping quality filter. When the stringent argument is on, reads passing the soft mapping quality and the one with the best mapping mapq will be considered as candidates for stringent criteria classification. On the other hand, no reads passing the soft mapping quality, their mapping quality will be considered as equally good and other metrics will be used to determine the best transcript assignment, as previous mentioned, e.g. PID and the distance to the transcript end.

The abovementioned bugs and enhancements have been implemented in this pull request.

@Jeltje
Copy link
Collaborator

Jeltje commented Jul 12, 2022

Finally getting around to this. A lot of changes have been made to the code since, so I'm implementing your solutions instead of merging the code. Thanks so much for doing this.

About your major bug 1, The argument of minimum supporting reads in the collapse function was not passed to collapse_isoforms_precise.py, it's not actually meant to be passed. The collapse function parameter --support looks at the reads supporting the final isoform set, while collapse_isoforms_precise.py uses its --support parameter to look at reads that support TSS or TES. The identical naming is unfortunate and I hope to change it for clarity.

@ctl43
Copy link
Author

ctl43 commented Jul 13, 2022

Thanks for maintaining FLAIR. I am thrilled to see that it is alive again! As the bugs are fixed, I will close this pull request.

@ctl43 ctl43 closed this Jul 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants