New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to judge CT/CTOT/CTOB/OB by FLAG, XR and XG in sam/bam file? #455
Comments
The whole strand and FLAG business is quite confusing, it also catches me out on many occasions! The reason for this is that the SAM format wasn't made to handle 4 strands of DNA, but assumes that there is one forward strand, and one reverse strand (which is the reverse complement). So for non-directional FLAG values in Bismark wen made the decision many years ago that both the OT and CTOT strands should appear as forward reads in genome browsers (e.g. Seqmonk or IGV), and both OB and CTOB strands should appear as reverse reads. Before then we used to have different FLAG values, but I think we changed this some 6-8 years back:
I hope this is the information you were looking for? |
I think there were some tools that would complain if the old FLAGs were used, so we agreed to change them to the current state. This was pre-Github, but the conversation might still be somewhere on Seqanswers.com? If we take a read pair for OT and CTOT as an example. OT: Read 1:
Read 2:
I think the FLAGs are accurate for OT. CTOT: Read 1:
Read 1 aligns in reverse orientation against the converted top strand sequence. As above, this reverse orientation is not the same as the original reverse strand of the input the DNA, but is an artificial strand that only exists in bisulfite sequencing (where converted top and bottom strand sequences are no longer complementary (there are exceptions, of course)). Since both OT and CTOT sequences are informative for the (original) top/forward strand of the input DNA, we made the decision swap the FLAGs of R1 and R2 round so that both OT and CTOT alignments look like they are top strand reads (facing in the forward direction). Read 2:
Again, the read identity has been swapped from 2, to 1 (First in Pair). I appreciate that this all quite confusing and arbitrary (which it actually is), but we had to settle for some sort of kludge to make the SAM format work in situations it is not designed for. TL;DR:
|
My apologies, now I gave you an explanation for the old FLAG definition... Argh, I have now tried to update the above comment to be correct. As general rule, in the BAM output the reads which are read in from the Read 1 FastQ file are always reported on the first line, Read 2 FastQ file reads are always on line 2. So the reads and their FLAG values are reported like so: OT:
CTOT:
So the table for For all downstream tools in the Bismark package, I determine the strand identity using solely the read and genome conversion states. The FLAG values are really just cosmetic (for displaying in genome browsers) in this regard. if ($read_conversion_1 eq 'CT' and $genome_conversion eq 'CT'){
$index = 0; ## this is OT (original top strand)
}
elsif ($read_conversion_1 eq 'GA' and $genome_conversion eq 'GA'){
$index = 1; ## this is CTOB (complementary to OB)
}
elsif ($read_conversion_1 eq 'GA' and $genome_conversion eq 'CT'){
$index = 2; ## this is CTOT (complementary to OT)
}
elsif ($read_conversion_1 eq 'CT' and $genome_conversion eq 'GA'){
$index = 3; ## this is OB (original bottom)
} I hope I got everything right this time... |
Thanks, 1.I got this rule : in the BAM output the reads which are read in from the Read 1 FastQ file are always reported on the first line, Read 2 FastQ file reads are always on line 2, is this means that can't use samtools sort -n to sort file by query name? because read with the same name will be ordered according to the values of the READ1 and READ2 flags. if do that , will mixed the strand information.
thanks again |
That might be right. The vanilla Bismark output will follow these rules, and thus allow discrimination of the strand identity: OT
CTOT
If you want to use tools downstream that mix up the ordering, you might struggle to get the order back. If you really have to do this for some reason you could post-process the output to add an additional optional strand tag to each alignment (or if you wanted I could add such a tag as an option to each BAM line?). to 2:) yes indeed. The conversion of the first read in the BAM file will, together with the genome conversion, dictate the aligment strand. |
Thanks Felix, CTOT OB CTOB
I've seen the brilliant code , It's very helpful for students like me, and I will study it carefully. Best wishes |
Regarding 1): I don't think adding 1/2 at the end of the queryname is a good idea, as the SAM format requires the QNAME to be the same for several reads of the same source (i.e. paired-end reads). Instead, I would suggest an optional tag, e.g. Regarding 2): Can you let me know the location of 'the official website' you think needs correcting? Cheers, Felix |
Thanks, 2.the official website: babraham bioinformatics(maybe older): Excellent job! Best wishes |
I have now tried to add a new option |
Yep, it works well. the default status of strandID is ON and now it's easy for me to do downstream analysis. Best wishes |
Alright, it is now optional as intended. And you are right, other downstream scripts like And in all fairness, you would need to run non-directional, paired-end alignments, then probably sort by coordinates, index, filter, re-sort by query name, and then try to feed the resulting files back into the downstream scripts to run into this issue (which explains that this has never come up in some eleven or so years of Bismark :) )... If you really needed to be doing this I would suggest you keep a record of the filtered readIDs, and then sub-set the original BAM file to keep only the reads you want - in the form that Bismark downstream tools would happily accept. Or probably even better, just:
I hope this is acceptable? |
Yep, I’m truly grateful for your help! |
Hi Felix,
it's using bismark to handle the non-directional data, but I don't know how to judge CT/CTOT/CTOB/OB.
Logically, I think :
CT read1 with XR:Z:CT XG:Z:CT and read2 with XR:Z:GA XG:Z:CT
CTOT read1 with XR:Z:GA XG:Z:CT and read2 with XR:Z:CT XG:Z:CT
CTOB read1 with XR:Z:GA XG:Z:GA and read2 with XR:Z:CT XG:Z:GA
OB read1 with XR:Z:CT XG:Z:GA and read2 with XR:Z:GA XG:Z:GA
and read the bismark script to confirm.
But , I find the FLAG value of Read1 and Read2 swap round(the annotation in script ), It's confusing to me.
So I asked for help.
Thanks!
The text was updated successfully, but these errors were encountered: