Fixed bugs and enhanced isoform quantification #181

ctl43 · 2021-10-26T05:36:35Z

Thanks for creating this wonderful package. While I used the program, I found few problems and made some updates for fixing bugs as well as a potentially better isoform choosing logic when stringent option is on.

Mirror things:

Currently, Flair is unable to handle soft-masked reference genome and it is fixed.
See issue (Unable to handle soft-masked reference genome #168).
Retained all intermediate files for troubleshoot when keep_intermediate is on in flair.py, including args.o+'firstpass.fa', alignout+'q.counts'.
Fixed mm2_args problem but not just removing the argument, see issue (Flair collapse stopping without error #175).

Major bugs:

The argument of minimum supporting reads in the collapse function was not passed to collapse_isoforms_precise.py, so some TSSs with sufficient support were also filtered. Currently, collapse_isoforms_precise.py will only choose TSS supported by 25% of reads. This value is also not the same as the value in its help manual, which stated minimum proportion(s<1) or number of supporting reads(s>=1) for an isoform (0.1).
When splicing donor and acceptor are at the same site, object ssData in ssPrep.py only stored the one that appeared first. Although this rarely happens, it does appear in real-life annotation (e.g. TMEM141 in gencode) and filtered those promising isoforms. This was fixed by storing the information of exon ends into two objects, ssData_1 and ssData_2.
When stringent argument is on, it will only select reads that cover 25 bp of the first and last exon. However, in count_sam_transcripts.py, it seems that it forgot to consider the strand information when determining the first blocksize and last_blocksize. If a gene is negative-stranded, the exon on the right side is the first exon, while the exon on the left side is the first exon. However, the script treated all exons on the left as the first exons and exons on the right as the last exons regardless of their strand. Unsurprisingly, the stringent classifier function, is_stringent will not work properly. As a result, reads not covering 25 bp of the first and last exon also passed the classifier.

Enhancement

Personally found that the transcript quantification does not work very well in both flair own quantification and salmon. For salmon, it needs specific arguments when handling ONT sequencing reads see the issue in salmon github, COMBINE-lab/salmon#602. For flair-own quantification, when the stringent argument is on, in flair.py, after samtools processed reads filtered reads based on mapping quality, count_sam_transcripts.py will no longer take mapping quality into consideration for finding the best alignment (Even one alignment is far better than the chosen one).

The culprit could be the metrics it used. When an alignment is closer to the transcript end and possesses more nucleotide matches, it will be considered as a better one. However, this metric may fail to choose the best one when a read has more matches but percentage identity (PID) in the aligned region is poor. If we consider the PID in the aligned region as well as mapping quality, it may help to have a better transcript assignment for reads.

Furthermore, it may be beneficial if we create two kinds of filters for mapping quality, one is a hard mapping quality filter, like what flair currently using, another one is a soft mapping quality filter. When the stringent argument is on, reads passing the soft mapping quality and the one with the best mapping mapq will be considered as candidates for stringent criteria classification. On the other hand, no reads passing the soft mapping quality, their mapping quality will be considered as equally good and other metrics will be used to determine the best transcript assignment, as previous mentioned, e.g. PID and the distance to the transcript end.

The abovementioned bugs and enhancements have been implemented in this pull request.

…ent option

…est transcript

Jeltje · 2022-07-12T03:39:55Z

Finally getting around to this. A lot of changes have been made to the code since, so I'm implementing your solutions instead of merging the code. Thanks so much for doing this.

About your major bug 1, The argument of minimum supporting reads in the collapse function was not passed to collapse_isoforms_precise.py, it's not actually meant to be passed. The collapse function parameter --support looks at the reads supporting the final isoform set, while collapse_isoforms_precise.py uses its --support parameter to look at reads that support TSS or TES. The identical naming is unfortunate and I hope to change it for clarity.

ctl43 · 2022-07-13T03:36:17Z

Thanks for maintaining FLAIR. I am thrilled to see that it is alive again! As the bugs are fixed, I will close this pull request.

ctl43 added 8 commits September 28, 2021 18:15

Handling soft-masked reference sequences

0d5d18c

Fixing bugs and updated logic for selecting best transcript in string…

686a3eb

…ent option

Separately storing splicing acceptor and donor information

d4ac2a2

Fixing bugs in stringent options and updating logic in choosing the b…

0da14cc

…est transcript

Added a soft mapping quality filter

b82fcf2

Implemented soft and hard quality cut off

ec0ce75

Fixed mm2_args problem but not just deleting it

8d62f20

Changed to use pariwise sapce for calulating PID

33ce5f8

ctl43 closed this Jul 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixed bugs and enhanced isoform quantification #181

Fixed bugs and enhanced isoform quantification #181

ctl43 commented Oct 26, 2021 •

edited

Loading

Jeltje commented Jul 12, 2022

ctl43 commented Jul 13, 2022

Fixed bugs and enhanced isoform quantification #181

Fixed bugs and enhanced isoform quantification #181

Conversation

ctl43 commented Oct 26, 2021 • edited Loading

Jeltje commented Jul 12, 2022

ctl43 commented Jul 13, 2022

ctl43 commented Oct 26, 2021 •

edited

Loading