Fixed bugs and enhanced isoform quantification #181
Closed
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Thanks for creating this wonderful package. While I used the program, I found few problems and made some updates for fixing bugs as well as a potentially better isoform choosing logic when
stringent
option is on.Mirror things:
Currently, Flair is unable to handle soft-masked reference genome and it is fixed.
See issue (Unable to handle soft-masked reference genome #168).
Retained all intermediate files for troubleshoot when keep_intermediate is on in
flair.py
, including args.o+'firstpass.fa', alignout+'q.counts'.Fixed mm2_args problem but not just removing the argument, see issue (Flair collapse stopping without error #175).
Major bugs:
The argument of minimum supporting reads in the
collapse
function was not passed tocollapse_isoforms_precise.py
, so some TSSs with sufficient support were also filtered. Currently,collapse_isoforms_precise.py
will only choose TSS supported by 25% of reads. This value is also not the same as the value in its help manual, which statedminimum proportion(s<1) or number of supporting reads(s>=1) for an isoform (0.1)
.When splicing donor and acceptor are at the same site, object
ssData
inssPrep.py
only stored the one that appeared first. Although this rarely happens, it does appear in real-life annotation (e.g. TMEM141 in gencode) and filtered those promising isoforms. This was fixed by storing the information of exon ends into two objects,ssData_1
andssData_2
.When
stringent
argument is on, it will only select reads that cover 25 bp of the first and last exon. However, incount_sam_transcripts.py
, it seems that it forgot to consider the strand information when determining thefirst blocksize
andlast_blocksize
. If a gene is negative-stranded, the exon on the right side is the first exon, while the exon on the left side is the first exon. However, the script treated all exons on the left as the first exons and exons on the right as the last exons regardless of their strand. Unsurprisingly, the stringent classifier function,is_stringent
will not work properly. As a result, reads not covering 25 bp of the first and last exon also passed the classifier.Enhancement
Personally found that the transcript quantification does not work very well in both flair own quantification and salmon. For salmon, it needs specific arguments when handling ONT sequencing reads see the issue in salmon github, COMBINE-lab/salmon#602. For flair-own quantification, when the stringent argument is on, in
flair.py
, aftersamtools
processed reads filtered reads based on mapping quality,count_sam_transcripts.py
will no longer take mapping quality into consideration for finding the best alignment (Even one alignment is far better than the chosen one).The culprit could be the metrics it used. When an alignment is closer to the transcript end and possesses more nucleotide matches, it will be considered as a better one. However, this metric may fail to choose the best one when a read has more matches but percentage identity (PID) in the aligned region is poor. If we consider the PID in the aligned region as well as mapping quality, it may help to have a better transcript assignment for reads.
Furthermore, it may be beneficial if we create two kinds of filters for mapping quality, one is a hard mapping quality filter, like what flair currently using, another one is a soft mapping quality filter. When the stringent argument is on, reads passing the soft mapping quality and the one with the best mapping mapq will be considered as candidates for stringent criteria classification. On the other hand, no reads passing the soft mapping quality, their mapping quality will be considered as equally good and other metrics will be used to determine the best transcript assignment, as previous mentioned, e.g. PID and the distance to the transcript end.
The abovementioned bugs and enhancements have been implemented in this pull request.