Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to judge CT/CTOT/CTOB/OB by FLAG, XR and XG in sam/bam file? #455

Closed
CristEin opened this issue Sep 2, 2021 · 14 comments
Closed

How to judge CT/CTOT/CTOB/OB by FLAG, XR and XG in sam/bam file? #455

CristEin opened this issue Sep 2, 2021 · 14 comments

Comments

@CristEin
Copy link

CristEin commented Sep 2, 2021

Hi Felix,
it's using bismark to handle the non-directional data, but I don't know how to judge CT/CTOT/CTOB/OB.
Logically, I think :
CT read1 with XR:Z:CT XG:Z:CT and read2 with XR:Z:GA XG:Z:CT
CTOT read1 with XR:Z:GA XG:Z:CT and read2 with XR:Z:CT XG:Z:CT
CTOB read1 with XR:Z:GA XG:Z:GA and read2 with XR:Z:CT XG:Z:GA
OB read1 with XR:Z:CT XG:Z:GA and read2 with XR:Z:GA XG:Z:GA
and read the bismark script to confirm.
But , I find the FLAG value of Read1 and Read2 swap round(the annotation in script ), It's confusing to me.
So I asked for help.
Thanks!

@FelixKrueger
Copy link
Owner

The whole strand and FLAG business is quite confusing, it also catches me out on many occasions! The reason for this is that the SAM format wasn't made to handle 4 strands of DNA, but assumes that there is one forward strand, and one reverse strand (which is the reverse complement).

So for non-directional FLAG values in Bismark wen made the decision many years ago that both the OT and CTOT strands should appear as forward reads in genome browsers (e.g. Seqmonk or IGV), and both OB and CTOB strands should appear as reverse reads. Before then we used to have different FLAG values, but I think we changed this some 6-8 years back:

--old_flag               Only in paired-end SAM mode, uses the FLAG values used by Bismark v0.8.2 and before. In addition,
                         this options appends /1 and /2 to the read IDs for reads 1 and 2 relative to the input file. Since
                         both the appended read IDs and custom FLAG values may cause problems with some downstream tools
                         such as Picard, new defaults were implemented as of version 0.8.3.


                                             default                         old_flag
                                       ===================              ===================
                                       Read 1       Read 2              Read 1       Read 2

                              OT:         99          147                  67          131

                              OB:         83          163                 115          179

                              CTOT:      147           99                  67          131

                              CTOB:      163           83                 115          179

I hope this is the information you were looking for?

@CristEin
Copy link
Author

CristEin commented Sep 2, 2021

thanks for your reply.
in my view , it can do judge this, just like I write before , and the code of bismark also confirms it :
截图_20213302033356

but I found that , it change the order of read1 and read2, the code is :
截图_20213602033601

for example, a read1 in fastq and bismark align(non_directional) , the FLAG of this read in sam file is 163, explained read2

of cause, there is annotation in script, but I don't know the reason
can explained this ? thanks again

@FelixKrueger
Copy link
Owner

FelixKrueger commented Sep 2, 2021

I think there were some tools that would complain if the old FLAGs were used, so we agreed to change them to the current state. This was pre-Github, but the conversation might still be somewhere on Seqanswers.com?

If we take a read pair for OT and CTOT as an example.

OT:

Read 1:

Read 1 is on the + strand and Read 2 is reversed (SAM FLAGS: 1 (read paired)+2 (proper pair)+32 (mate reverse strand)+64 (first in pair) = 99)

Read 2:

Read 2 is reverse complemented but informative for the OT (SAM FLAGS: 1+2+16 (read reverse strand)+128 (second in pair) = 147)

I think the FLAGs are accurate for OT.

CTOT:

Read 1:

Read 1 is reverse complemented (CTOT) and Read 2 is the forward read
but we swap round the FLAG for R1 and R2 so that we do not end up with discordant pairs
So Read 1 gets Read paired (1), read mapped in proper pair (2), read reverse strand and second in pair (1+2+16+128 = 147)

Read 1 aligns in reverse orientation against the converted top strand sequence. As above, this reverse orientation is not the same as the original reverse strand of the input the DNA, but is an artificial strand that only exists in bisulfite sequencing (where converted top and bottom strand sequences are no longer complementary (there are exceptions, of course)).

Since both OT and CTOT sequences are informative for the (original) top/forward strand of the input DNA, we made the decision swap the FLAGs of R1 and R2 round so that both OT and CTOT alignments look like they are top strand reads (facing in the forward direction).

Read 2:

Read 2 is on the + strand, and gets Read paired (1), read mapped in proper pair (2), mate reverse strand (32) and First in Pair (1+2+32+64)

Again, the read identity has been swapped from 2, to 1 (First in Pair).

I appreciate that this all quite confusing and arbitrary (which it actually is), but we had to settle for some sort of kludge to make the SAM format work in situations it is not designed for.

TL;DR:

  • OT and CTOT alignments both look like they are forward oriented
  • OB and CTOB alignments both look like they are reverse oriented
  • thus, if you want to work with the strand identity of non-directional data for downstream scripts, you need to use the XR/XG tags as the FLAG values are not unambiguous (as we have seen...)

@CristEin
Copy link
Author

CristEin commented Sep 2, 2021

Thanks,
1.it's different between the official website and your reply about the FLAG of CT/CTOT/OB/CTOB, which one is correct or newer ?
截图_20214402044411

2.if swap the FLAGs of R1 and R2 round, so we can't judge which one is Read1/Read2 by FLAG value, so how to judge it? by coordinate?

2.or how to judge CT/CTOT/CTOB/OB by FLAG, XR and XG?
thanks again

@FelixKrueger
Copy link
Owner

My apologies, now I gave you an explanation for the old FLAG definition... Argh, I have now tried to update the above comment to be correct.

As general rule, in the BAM output the reads which are read in from the Read 1 FastQ file are always reported on the first line, Read 2 FastQ file reads are always on line 2. So the reads and their FLAG values are reported like so:

OT:

Read 1         99       (first in pair)
Read 2        147       (second in pair)

CTOT:

Read 1        147        (second in pair)
Read 2         99        (first in pair)

So the table for --old_flag is correct when you look at the SAM FLAGs for which read identifies itself as first or second read, but for CTOT and CTOB this is the opposite to what you would find in the input FastQ files.

For all downstream tools in the Bismark package, I determine the strand identity using solely the read and genome conversion states. The FLAG values are really just cosmetic (for displaying in genome browsers) in this regard.

if ($read_conversion_1 eq 'CT' and $genome_conversion eq 'CT'){
     $index = 0; ## this is OT   (original top strand)
}
elsif ($read_conversion_1 eq 'GA' and $genome_conversion eq 'GA'){
     $index = 1; ## this is CTOB (complementary to OB)
}
elsif ($read_conversion_1 eq 'GA' and $genome_conversion eq 'CT'){
     $index = 2; ## this is CTOT (complementary to OT)
}
elsif ($read_conversion_1 eq 'CT' and $genome_conversion eq 'GA'){
     $index = 3; ## this is OB   (original bottom)
}	

I hope I got everything right this time...

@CristEin
Copy link
Author

CristEin commented Sep 2, 2021

Thanks,
OT:
Read 1 99 (first in pair) Read 2 147 (second in pair)
CTOT:
Read 1 147 (second in pair) Read 2 99 (first in pair)

1.I got this rule : in the BAM output the reads which are read in from the Read 1 FastQ file are always reported on the first line, Read 2 FastQ file reads are always on line 2, is this means that can't use samtools sort -n to sort file by query name? because read with the same name will be ordered according to the values of the READ1 and READ2 flags. if do that , will mixed the strand information.

  1. the $read_conversion_1 is first line of mate-pair read in bam file ?

thanks again

@FelixKrueger
Copy link
Owner

That might be right. The vanilla Bismark output will follow these rules, and thus allow discrimination of the strand identity:

OT

Read 1	99 (1st)	XR:Z:CT	XG:Z:CT
Read 2	147 (2nd)	XR:Z:GA	XG:Z:CT

CTOT

Read 1	147 (2nd)	XR:Z:GA	XG:Z:CT
Read 2	99 (1st)	XR:Z:CT	XG:Z:CT

If you want to use tools downstream that mix up the ordering, you might struggle to get the order back. If you really have to do this for some reason you could post-process the output to add an additional optional strand tag to each alignment (or if you wanted I could add such a tag as an option to each BAM line?).

to 2:) yes indeed. The conversion of the first read in the BAM file will, together with the genome conversion, dictate the aligment strand.

@CristEin
Copy link
Author

CristEin commented Sep 2, 2021

Thanks Felix,
OT
Read 1 99 (1st) XR:Z:CT XG:Z:CT
Read 2 147 (2nd) XR:Z:GA XG:Z:CT

CTOT
Read 1 147 (2nd) XR:Z:GA XG:Z:CT
Read 2 99 (1st) XR:Z:CT XG:Z:CT

OB
Read 1 83(1st) XR:Z:CT XG:Z:GA
Read 2 163(2nd) XR:Z:GA XG:Z:GA

CTOB
Read 1 163(1st) XR:Z:GA XG:Z:GA
Read 2 83(2nd) XR:Z:CT XG:Z:GA

1.maybe can add 1/2 after quryname, or add new tag to reserve it and so on
2.hope you can correct the official website about the FLAG of CT/CTOT/OB/CTOB

I've seen the brilliant code , It's very helpful for students like me, and I will study it carefully.
Thank you very much for your careful and detailed reply

Best wishes

@FelixKrueger
Copy link
Owner

Regarding 1):

I don't think adding 1/2 at the end of the queryname is a good idea, as the SAM format requires the QNAME to be the same for several reads of the same source (i.e. paired-end reads). Instead, I would suggest an optional tag, e.g. YS:Z:ALIGNMENT_STRAND, which would enable you to sort the reads by coordinate and back to queryname to your heart's content while still being able to know what is going on.

Regarding 2):

Can you let me know the location of 'the official website' you think needs correcting?

Cheers, Felix

@CristEin
Copy link
Author

CristEin commented Sep 2, 2021

Thanks,
1.it's pretty good to add an optional tag, because not everyone can judge by the combination of Order of read1/read2、 XR and XG

2.the official website:
github:
https://github.com/FelixKrueger/Bismark/tree/master/Docs

babraham bioinformatics(maybe older):
https://rawgit.com/FelixKrueger/Bismark/master/Docs/Bismark_User_Guide.html

Excellent job!

Best wishes

@FelixKrueger
Copy link
Owner

I have now tried to add a new option --strandID, can you please clone the latest dev version, and see if it works for you? It should produce additional optional tags such as YS:Z:CTOB, YS:Z:OT, etc. If it is useful for you and you are happy with it, I'll add it in more permanently.

@CristEin
Copy link
Author

CristEin commented Sep 3, 2021

Yep, it works well.

the default status of strandID is ON and now it's easy for me to do downstream analysis.
but another question, whether the bismark toolkit , such as bismark_methylation_extractor, use this strandID to judge strand and call methyl? I mean that if one use bimark to align and use samtools to filter read or sort it etc. next use bismark_methylation_extractor to call methyl, It may produce the wrong result

Best wishes

@FelixKrueger
Copy link
Owner

Alright, it is now optional as intended.

And you are right, other downstream scripts like deduplicate_bismark, bismark_methylation_extractor, and probably a few others, do not make use of this new strandID, but instead derive the information themselves. I am afraid if someone deliberately wants to introduce several additional steps that interfere with the canonical Bismark workflow they need to be willing to put in a little extra work...

And in all fairness, you would need to run non-directional, paired-end alignments, then probably sort by coordinates, index, filter, re-sort by query name, and then try to feed the resulting files back into the downstream scripts to run into this issue (which explains that this has never come up in some eleven or so years of Bismark :) )... If you really needed to be doing this I would suggest you keep a record of the filtered readIDs, and then sub-set the original BAM file to keep only the reads you want - in the form that Bismark downstream tools would happily accept. Or probably even better, just:

  1. run Bismark > Deduplicate > MethylXtract
  2. in parallel, run your sorting, filtering, re-sorting, and keep a record of read IDs you want to keep
  3. subset the C* methylation calls and keep only calls of readIDs you want to keep (should be a 10 line script or so...)

I hope this is acceptable?

@CristEin
Copy link
Author

CristEin commented Sep 5, 2021

Yep, I’m truly grateful for your help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants