Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Read marked as ambiguous but bowtie2 only returned a single hit #108

Closed
martinjvickers opened this issue Jun 1, 2017 · 15 comments
Closed

Comments

@martinjvickers
Copy link

martinjvickers commented Jun 1, 2017

TL:DR

I have the following read that bismark won't map against TAIR10 that I think should do;

@HS3:420:C3EHMACXX:8:1101:5227:2035 1:Y:0:ACCTCA
NAAAAAAAATGTTGTTGAATAATTTTTGAAAATTATTGGATATGATGTAATGTTTTGTGATTGAATTTTTTAAAATATATTAATAAAGAGTTTAGGATGT
+
#4=AAAAAAB3?A+@B):ABBBBBBBB0=ABBBBBBBB9*==?A4>)=>7>ABBBB26)=>A)=AAAAA?=@@@@;??>?>BBBB??#############

The reference is TAIR10 with ChrM and ChrC. To try to rule anything odd on my desktop and allow for it to be recreated I've just done this on an AWS Ubuntu instance and put the finer details here in this gist so it can be recreated;

https://gist.github.com/martinjvickers/b987d076aaad5f066b6afb5885a3bb8d

Bit more detail

If you run bowtie2 on this directional single end read (I know I'd have to convert the C's to T's but since there are none in this read I've not) you get a single hit from the CT-CT run and an unmapped result for the CT-GA run.

HS3:420:C3EHMACXX:8:1101:5227:2035	0	2_CT_converted	1518	1	100M	*	0	0	NAAAAAAAATGTTGTTGAATAATTTTTGAAAATTATTGGATATGATGTAATGTTTTGTGATTGAATTTTTTAAAATATATTAATAAAGAGTTTAGGATGT#4=AAAAAAB3?A+@B):ABBBBBBBB0=ABBBBBBBB9*==?A4>)=>7>ABBBB26)=>A)=AAAAA?=@@@@;??>?>BBBB??#############	AS:i:-1	XS:i:-1	XN:i:0	XM:i:1	XO:i:0	XG:i:0	NM:i:1	MD:Z:0A99	YT:Z:UU
HS3:420:C3EHMACXX:8:1101:5227:2035	4	*	0	0	*	*	0	0	NAAAAAAAATGTTGTTGAATAATTTTTGAAAATTATTGGATATGATGTAATGTTTTGTGATTGAATTTTTTAAAATATATTAATAAAGAGTTTAGGATGT	#4=AAAAAAB3?A+@B):ABBBBBBBB0=ABBBBBBBB9*==?A4>)=>7>ABBBB26)=>A)=AAAAA?=@@@@;??>?>BBBB??#############	YT:Z:UU

So I'd have thought that this would be a hit.

I forked bismark and un-commented all of the print out statements in the check_bowtie_results_single_end_bowtie2 function (https://github.com/martinjvickers/Bismark) and reran it and it's being marked as ambiguous. Even more strangely there appears to be a third unmapped hit which bowtie2 never returned when run on individually;

HS3:420:C3EHMACXX:8:1101:5227:2035_1:Y:0:ACCTCA	256	*	0	0	*	*	0	0	NAAAAAAAATGTTGTTGAATAATTTTTGAAAATTATTGGATATGATGTAATGTTTTGTGATTGAATTTTTTAAAATATATTAATAAAGAGTTTAGGATGT#4=AAAAAAB3?A+@B):ABBBBBBBB0=ABBBBBBBB9*==?A4>)=>7>ABBBB26)=>A)=AAAAA?=@@@@;??>?>BBBB??#############

Is this a bug or am I misunderstanding something about how the reads are determined as ambiguous or how bowtie2 is used?

@FelixKrueger
Copy link
Owner

Hi Martin,

I have tried to recreate what you described on our cluster, and indeed the read is considered ambiguous here as well.

The important part in the alignment to the OT strand is: AS:i:-1 XS:i:-1, which means that the alignment has an alignment score of -1 (the 'N' in the read), but there is at least one more alignments that also has an alignment score of -1 (encoded in the XS flag). The FLAG value of 256 simply means 'not primary alignment' which is anything but the first alignment.

From the Bowtie2 manual:

 XS:i:<N>
Alignment score for the best-scoring alignment found other than the
alignment reported.  Can be negative.  Can be greater than 0 in [`--local`]
mode (but not in [`--end-to-end`] mode).  Only present if the SAM record is
for an aligned read and more than one alignment was found for the read.

To illustrate this a bit more I have then checked the Bowtie 2 results against the CT and GA genome indexes. While the read doesn't align to the GA converted genome at all, there are several possible alignments for the CT converted version:

@PG	ID:bowtie2	PN:bowtie2	VN:2.3.1	CL:"/bi/apps/bowtie2/2.3.1/bowtie2-align-s --wrapper basic-0 -t -x /bi/scratch/Genomes/Arabidopsis/TAIR10/Bisulfite_Genome/CT_conversion/BS_CT -a -U read1.fastq"
HS3:420:C3EHMACXX:8:1101:5227:2035	0	2_CT_converted	1518	1	100M	*	0	0	NAAAAAAAATGTTGTTGAATAATTTTTGAAAATTATTGGATATGATGTAATGTTTTGTGATTGAATTTTTTAAAATATATTAATAAAGAGTTTAGGATGT	#4=AAAAAAB3?A+@B):ABBBBBBBB0=ABBBBBBBB9*==?A4>)=>7>ABBBB26)=>A)=AAAAA?=@@@@;??>?>BBBB??#############	AS:i:-1	XS:i:-1	XN:i:0	XM:i:1	XO:i:0	XG:i:0	NM:i:1	MD:Z:0A99	YT:Z:UU
HS3:420:C3EHMACXX:8:1101:5227:2035	256	3_CT_converted	14195485	1	100M	*	0	0	NAAAAAAAATGTTGTTGAATAATTTTTGAAAATTATTGGATATGATGTAATGTTTTGTGATTGAATTTTTTAAAATATATTAATAAAGAGTTTAGGATGT	#4=AAAAAAB3?A+@B):ABBBBBBBB0=ABBBBBBBB9*==?A4>)=>7>ABBBB26)=>A)=AAAAA?=@@@@;??>?>BBBB??#############	AS:i:-1	XS:i:-1	XN:i:0	XM:i:1	XO:i:0	XG:i:0	NM:i:1	MD:Z:0A99	YT:Z:UU
HS3:420:C3EHMACXX:8:1101:5227:2035	256	1_CT_converted	12293213	1	100M	*	0	0	NAAAAAAAATGTTGTTGAATAATTTTTGAAAATTATTGGATATGATGTAATGTTTTGTGATTGAATTTTTTAAAATATATTAATAAAGAGTTTAGGATGT	#4=AAAAAAB3?A+@B):ABBBBBBBB0=ABBBBBBBB9*==?A4>)=>7>ABBBB26)=>A)=AAAAA?=@@@@;??>?>BBBB??#############	AS:i:-49	XS:i:-1	XN:i:0	XM:i:14	XO:i:0	XG:i:0	NM:i:14	MD:Z:0G8A0A0G4A10A2T4A2A0G25A1A28T2A0	YT:Z:UU
HS3:420:C3EHMACXX:8:1101:5227:2035	256	1_CT_converted	12293558	1	100M	*	0	0	NAAAAAAAATGTTGTTGAATAATTTTTGAAAATTATTGGATATGATGTAATGTTTTGTGATTGAATTTTTTAAAATATATTAATAAAGAGTTTAGGATGT	#4=AAAAAAB3?A+@B):ABBBBBBBB0=ABBBBBBBB9*==?A4>)=>7>ABBBB26)=>A)=AAAAA?=@@@@;??>?>BBBB??#############	AS:i:-50	XS:i:-1	XN:i:0	XM:i:14	XO:i:0	XG:i:0	NM:i:14	MD:Z:0G6G1A0A0G15T7A2A0G2A13A5A2A28A5	YT:Z:UU
HS3:420:C3EHMACXX:8:1101:5227:2035	256	1_CT_converted	12292868	1	11M1I88M	*	0	0	NAAAAAAAATGTTGTTGAATAATTTTTGAAAATTATTGGATATGATGTAATGTTTTGTGATTGAATTTTTTAAAATATATTAATAAAGAGTTTAGGATGT	#4=AAAAAAB3?A+@B):ABBBBBBBB0=ABBBBBBBB9*==?A4>)=>7>ABBBB26)=>A)=AAAAA?=@@@@;??>?>BBBB??#############	AS:i:-60	XS:i:-1	XN:i:0	XM:i:15	XO:i:1	XG:i:1	NM:i:16	MD:Z:0G5G5A13A3G4A1T0T1G9A13A4A25T1T0A0	YT:Z:UU

So as you can see the read aligns perfectly to chromosome 2 and 3, the other 3 alignments have quite poor alignment scores. In conclusion, to me it looks like the read is identified correctly as ambiguous and therefore not present in the final BAM file. I hope this clears things up a bit?

@martinjvickers
Copy link
Author

Hi Felix, thank you for your swift response.

Does this only apply to the CT_converted reference? Since I have the opposite case, where bismark maps the following read to the GA_converted reference (the CT is unmapped);

Read

@HS3:420:C3EHMACXX:8:1101:10176:30378 1:N:0:ACCTCA
ATAAGATCGTGTTGCGGTTTAAGTTTTTATATTTAATAATATATATGATATTAAGTTAAATTCGATTTTAAAATATTAATTAATTTTTTTTTTGTTTTTT
+
@CCFDFFFFFFHHGFGGGGHIJJIHIJJJJJJJJJJJJJJJJJJJJFIIJJJJJJJHIHGIJJCHHJJJJJIJJIDHHHHHGHFFFFFDDDDD5ADDDDD

Bismark result

HS3:420:C3EHMACXX:8:1101:10176:30378_1:N:0:ACCTCA	16	3	13591519	0	100M	*	0	0	AAAAAACAAAAAAAAAATTAATTAATATTTTAAAATCGAATTTAACTTAATATCATATATATTATTAAATATAAAAACTTAAACCGCAACACGATCTTAT	DDDDDA5DDDDDFFFFFHGHHHHHDIJJIJJJJJHHCJJIGHIHJJJJJJJIIFJJJJJJJJJJJJJJJJJJJJIHIJJIHGGGGFGHHFFFFFFDFCC@	NM:i:26	MD:Z:0G1G2G3G2G2G0G2G0G3G1G4G0G1G6A1G4G2G4G1G3G3G1G5G15C7G1	XM:Z:h.h..h...h..h..hh..hh...h.h....hh.h..Z.....h....h..h....h.h.......h.h.....h..........Z......Z.....h.XR:Z:CT	XG:Z:GA

but when you look at the bowtie2 run;

Bowtie2

bowtie2 -q --score-min L,0,-0.2 --ignore-quals --nofw -x TAIR10/Bisulfite_Genome/GA_conversion/BS_GA -U test_7_CT.fastq
1 reads; of these:
  1 (100.00%) were unpaired; of these:
    0 (0.00%) aligned 0 times
    0 (0.00%) aligned exactly 1 time
    1 (100.00%) aligned >1 times
100.00% overall alignment rate
@HD	VN:1.0	SO:unsorted
@SQ	SN:1_GA_converted	LN:30427671
@SQ	SN:2_GA_converted	LN:19698289
@SQ	SN:3_GA_converted	LN:23459830
@SQ	SN:4_GA_converted	LN:18585056
@SQ	SN:5_GA_converted	LN:26975502
@SQ	SN:mitochondria_GA_converted	LN:366924
@SQ	SN:chloroplast_GA_converted	LN:154478
@PG	ID:bowtie2	PN:bowtie2	VN:2.1.0
HS3:420:C3EHMACXX:8:1101:10176:30378	16	3_GA_converted	13592231	0	100M	*	0	0	AAAAAACAAAAAAAAAATTAATTAATATTTTAAAATCAAATTTAACTTAATATCATATATATTATTAAATATAAAAACTTAAACCACAACACAATCTTAT	DDDDDA5DDDDDFFFFFHGHHHHHDIJJIJJJJJHHCJJIGHIHJJJJJJJIIFJJJJJJJJJJJJJJJJJJJJIHIJJIHGGGGFGHHFFFFFFDFCC@	AS:i:-18	XS:i:-18	XN:i:0	XM:i:3	XO:i:0	XG:i:0	NM:i:3	MD:Z:41A20A27C9	YT:Z:UU

I get a different returned hit (position 13592231 rather than 13591519 in the same Chromosome) with AS -18 XS 18. I can see why when I run bowtie -a, but that's not what bismark is running right? Isn't bismark calling bowtie2 with the following parameters (-q --score-min L,0,-0.2 --ignore-quals --nofw NB I'm just using the defaults at this point e.g. ./Bismark/bismark TAIR10 test.fastq)

Bowtie2 -a

bowtie2 -a -q --score-min L,0,-0.2 --ignore-quals --nofw -x TAIR10/Bisulfite_Genome/GA_conversion/BS_GA -U test_7_CT.fastq
1 reads; of these:
  1 (100.00%) were unpaired; of these:
    0 (0.00%) aligned 0 times
    0 (0.00%) aligned exactly 1 time
    1 (100.00%) aligned >1 times
100.00% overall alignment rate
@HD	VN:1.0	SO:unsorted
@SQ	SN:1_GA_converted	LN:30427671
@SQ	SN:2_GA_converted	LN:19698289
@SQ	SN:3_GA_converted	LN:23459830
@SQ	SN:4_GA_converted	LN:18585056
@SQ	SN:5_GA_converted	LN:26975502
@SQ	SN:mitochondria_GA_converted	LN:366924
@SQ	SN:chloroplast_GA_converted	LN:154478
@PG	ID:bowtie2	PN:bowtie2	VN:2.1.0
HS3:420:C3EHMACXX:8:1101:10176:30378	16	3_GA_converted	13592231	0	100M	*	0	0	AAAAAACAAAAAAAAAATTAATTAATATTTTAAAATCAAATTTAACTTAATATCATATATATTATTAAATATAAAAACTTAAACCACAACACAATCTTAT	DDDDDA5DDDDDFFFFFHGHHHHHDIJJIJJJJJHHCJJIGHIHJJJJJJJIIFJJJJJJJJJJJJJJJJJJJJIHIJJIHGGGGFGHHFFFFFFDFCC@	AS:i:-18	XS:i:-18	XN:i:0	XM:i:3	XO:i:0	XG:i:0	NM:i:3	MD:Z:41A20A27C9	YT:Z:UU
HS3:420:C3EHMACXX:8:1101:10176:30378	272	3_GA_converted	13591519	0	100M	*	0	0	AAAAAACAAAAAAAAAATTAATTAATATTTTAAAATCAAATTTAACTTAATATCATATATATTATTAAATATAAAAACTTAAACCACAACACAATCTTAT	DDDDDA5DDDDDFFFFFHGHHHHHDIJJIJJJJJHHCJJIGHIHJJJJJJJIIFJJJJJJJJJJJJJJJJJJJJIHIJJIHGGGGFGHHFFFFFFDFCC@	AS:i:-18	XS:i:-18	XN:i:0	XM:i:3	XO:i:0	XG:i:0	NM:i:3	MD:Z:41A20A27C9	YT:Z:UU
HS3:420:C3EHMACXX:8:1101:10176:30378	272	2_GA_converted	3626258	0	100M	*	0	0	AAAAAACAAAAAAAAAATTAATTAATATTTTAAAATCAAATTTAACTTAATATCATATATATTATTAAATATAAAAACTTAAACCACAACACAATCTTAT	DDDDDA5DDDDDFFFFFHGHHHHHDIJJIJJJJJHHCJJIGHIHJJJJJJJIIFJJJJJJJJJJJJJJJJJJJJIHIJJIHGGGGFGHHFFFFFFDFCC@	AS:i:-18	XS:i:-18	XN:i:0	XM:i:3	XO:i:0	XG:i:0	NM:i:3	MD:Z:41A20A27C9	YT:Z:UU
HS3:420:C3EHMACXX:8:1101:10176:30378	272	2_GA_converted	3615015	0	100M	*	0	0	AAAAAACAAAAAAAAAATTAATTAATATTTTAAAATCAAATTTAACTTAATATCATATATATTATTAAATATAAAAACTTAAACCACAACACAATCTTAT	DDDDDA5DDDDDFFFFFHGHHHHHDIJJIJJJJJHHCJJIGHIHJJJJJJJIIFJJJJJJJJJJJJJJJJJJJJIHIJJIHGGGGFGHHFFFFFFDFCC@	AS:i:-18	XS:i:-18	XN:i:0	XM:i:3	XO:i:0	XG:i:0	NM:i:3	MD:Z:41A20A27C9	YT:Z:UU
HS3:420:C3EHMACXX:8:1101:10176:30378	272	2_GA_converted	3618087	0	100M	*	0	0	AAAAAACAAAAAAAAAATTAATTAATATTTTAAAATCAAATTTAACTTAATATCATATATATTATTAAATATAAAAACTTAAACCACAACACAATCTTAT	DDDDDA5DDDDDFFFFFHGHHHHHDIJJIJJJJJHHCJJIGHIHJJJJJJJIIFJJJJJJJJJJJJJJJJJJJJIHIJJIHGGGGFGHHFFFFFFDFCC@	AS:i:-18	XS:i:-18	XN:i:0	XM:i:3	XO:i:0	XG:i:0	NM:i:3	MD:Z:41A20A27C9	YT:Z:UU

So in this case, wouldn't this be ambiguous?

@FelixKrueger
Copy link
Owner

Hi Martin,

I think you have found a special case here. The read does indeed show 5 alignments with the command you used even with the latest version of Bowtie2 v2.3.2. If you would use the Bowtie2 defaults (which are `--score-min L,-0.6,-0.6) you would find that the read does in fact yield 306 ambiguous alignments!

And I think exactly this is the problem here: the read aligns to so many different repetitive loci that it actually gives up trying any further, but in this case it didn't find a good second alignment yet. So the command:

ID:bowtie2      PN:bowtie2      VN:2.3.2        CL:"/bi/apps/bowtie2/2.3.2/bowtie2-align-s --wrapper basic-0 -q -t --score-min L,0,-0.2 --ignore-quals --nofw -x /bi/scratch/Genomes/Arabidopsis/TAIR10/Bisulfite_Genome/GA_conversion/BS_GA -U read2.fastq_C_to_T.fastq"
HS3:420:C3EHMACXX:8:1101:10176:30378_1:N:0:ACCTCA       16      3_GA_converted  13591519        0       100M    *       0       0       AAAAAACAAAAAAAAAATTAATTAATATTTTAAAATCAAATTTAACTTAATATCATATATATTATTAAATATAAAAACTTAAACCACAACACAATCTTATDDDDDA5DDDDDFFFFFHGHHHHHDIJJIJJJJJHHCJJIGHIHJJJJJJJIIFJJJJJJJJJJJJJJJJJJJJIHIJJIHGGGGFGHHFFFFFFDFCC@     AS:i:-18        XN:i:0  XM:i:3  XO:i:0  XG:i:0  NM:i:3  MD:Z:41A20A27C9 YT:Z:UU

produces one alignment (AS:i:-18) but has not found a secondary alignment yet (hence no XS:i: field). As a result of this the alignment is considered unique.

If you increase the -D parameter which governs how hard Bowtie2 is going to look for valid alignments the situation looks much different:

PG     ID:bowtie2      PN:bowtie2      VN:2.3.2        CL:"/bi/apps/bowtie2/2.3.2/bowtie2-align-s --wrapper basic-0 -q -t -D 50 --score-min L,0,-0.2 --ignore-quals --nofw -x /bi/scratch/Genomes/Arabidopsis/TAIR10/Bisulfite_Genome/GA_conversion/BS_GA -U read2.fastq_C_to_T.fastq"
HS3:420:C3EHMACXX:8:1101:10176:30378_1:N:0:ACCTCA       16      3_GA_converted  13591519        0       100M    *       0       0       AAAAAACAAAAAAAAAATTAATTAATATTTTAAAATCAAATTTAACTTAATATCATATATATTATTAAATATAAAAACTTAAACCACAACACAATCTTATDDDDDA5DDDDDFFFFFHGHHHHHDIJJIJJJJJHHCJJIGHIHJJJJJJJIIFJJJJJJJJJJJJJJJJJJJJIHIJJIHGGGGFGHHFFFFFFDFCC@     AS:i:-18        XS:i:-18        XN:i:0  XM:i:3  XO:i:0  XG:i:0  NM:i:3  MD:Z:41A20A27C9 YT:Z:UU

Here, both AS:i:-18 XS:i:-18 are reported, and the read is consequently also rejected by Bismark.

-D <int>                 Up to <int> consecutive seed extension attempts can "fail" before Bowtie 2 moves on, using
                         the alignments found so far. A seed extension "fails" if it does not yield a new best or a
                         new second-best alignment. Default: 15.

I suppose the design decision of the default Bowtie2 alignment mode is a trade-off between overall speed and sensitivity....

@martinjvickers
Copy link
Author

Hi Felix, thanks for taking the time to chat about it.

I'm confused how you're not getting the XS -18 in your example, if I run the same parameters as yours (with 2.3.2) I get;

bowtie2 2.3.2

$ ~/software/bowtie2-2.3.2/bowtie2 -q --score-min L,0,-0.2 --ignore-quals --nofw -x TAIR10/Bisulfite_Genome/GA_conversion/BS_GA -U ../test_CT.fastq 1 reads; of these:
  1 (100.00%) were unpaired; of these:
    0 (0.00%) aligned 0 times
    0 (0.00%) aligned exactly 1 time
    1 (100.00%) aligned >1 times
100.00% overall alignment rate
@HD	VN:1.0	SO:unsorted
@SQ	SN:1_GA_converted	LN:30427671
@SQ	SN:2_GA_converted	LN:19698289
@SQ	SN:3_GA_converted	LN:23459830
@SQ	SN:4_GA_converted	LN:18585056
@SQ	SN:5_GA_converted	LN:26975502
@SQ	SN:mitochondria_GA_converted	LN:366924
@SQ	SN:chloroplast_GA_converted	LN:154478
@PG	ID:bowtie2	PN:bowtie2	VN:2.3.2	CL:"/usr/users/JIC_c1/mvickers/software/bowtie2-2.3.2/bowtie2-align-s --wrapper basic-0 -q --score-min L,0,-0.2 --ignore-quals --nofw -x TAIR10/Bisulfite_Genome/GA_conversion/BS_GA -U ../test_CT.fastq"
HS3:420:C3EHMACXX:8:1101:10176:30378	16	2_GA_converted	3618087	0	100M	*	0	0	AAAAAACAAAAAAAAAATTAATTAATATTTTAAAATCAAATTTAACTTAATATCATATATATTATTAAATATAAAAACTTAAACCACAACACAATCTTATDDDDDA5DDDDDFFFFFHGHHHHHDIJJIJJJJJHHCJJIGHIHJJJJJJJIIFJJJJJJJJJJJJJJJJJJJJIHIJJIHGGGGFGHHFFFFFFDFCC@	AS:i:-18	XS:i:-18	XN:i:0	XM:i:3	XO:i:0	XG:i:0	NM:i:3	MD:Z:41A20A27C9	YT:Z:UU

I can understand why we get different alignments depending on which got returned first or different seed etc, but not why it's missing the XS flag.

The part that I'm more baffled about is when I run bismark on this same machine I don't get the XS flag. Surely these should be identical if I'm sat at the same machine with the same version of bowtie2.

Running bismark using bowtie2-2.3.2

$ ./Bismark/bismark --path_to_bowtie ~/software/bowtie2-2.3.2 TAIR10 test.fastq 

Now running 2 instances of Bowtie 2 against the bisulfite genome of TAIR10/ with the specified options: -q --score-min L,0,-0.2 --ignore-quals

Now starting the Bowtie 2 aligner for CTreadCTgenome (reading in sequences from test.fastq_C_to_T.fastq with options -q --score-min L,0,-0.2 --ignore-quals --norc)
Using Bowtie 2 index: TAIR10/Bisulfite_Genome/CT_conversion/BS_CT

1 reads; of these:
  1 (100.00%) were unpaired; of these:
    1 (100.00%) aligned 0 times
    0 (0.00%) aligned exactly 1 time
    0 (0.00%) aligned >1 times
0.00% overall alignment rate
Found first alignment:	HS3:420:C3EHMACXX:8:1101:10176:30378_1:N:0:ACCTCA	4	*	0	0	*	*	0	0	ATAAGATTGTGTTGTGGTTTAAGTTTTTATATTTAATAATATATATGATATTAAGTTAAATTTGATTTTAAAATATTAATTAATTTTTTTTTTGTTTTTT	@CCFDFFFFFFHHGFGGGGHIJJIHIJJJJJJJJJJJJJJJJJJJJFIIJJJJJJJHIHGIJJCHHJJJJJIJJIDHHHHHGHFFFFFDDDDD5ADDDDD	YT:Z:UU
Now starting the Bowtie 2 aligner for CTreadGAgenome (reading in sequences from test.fastq_C_to_T.fastq with options -q --score-min L,0,-0.2 --ignore-quals --nofw)
Using Bowtie 2 index: TAIR10/Bisulfite_Genome/GA_conversion/BS_GA

1 reads; of these:
  1 (100.00%) were unpaired; of these:
    0 (0.00%) aligned 0 times
    1 (100.00%) aligned exactly 1 time
    0 (0.00%) aligned >1 times
100.00% overall alignment rate
Found first alignment:	HS3:420:C3EHMACXX:8:1101:10176:30378_1:N:0:ACCTCA	16	3_GA_converted	13591519	0	100M	*	0	0	AAAAAACAAAAAAAAAATTAATTAATATTTTAAAATCAAATTTAACTTAATATCATATATATTATTAAATATAAAAACTTAAACCACAACACAATCTTAT	DDDDDA5DDDDDFFFFFHGHHHHHDIJJIJJJJJHHCJJIGHIHJJJJJJJIIFJJJJJJJJJJJJJJJJJJJJIHIJJIHGGGGFGHHFFFFFFDFCC@	AS:i:-18	XN:i:0	XM:i:3	XO:i:0	XG:i:0	NM:i:3	MD:Z:41A20A27C9	YT:Z:UU

When bismark says it is running with options -q --score-min L,0,-0.2 --ignore-quals --nofw, there aren't other flags set are there?

@FelixKrueger
Copy link
Owner

Hi Martin,

This is indeed a surprising effect, which I cannot reproduce on our cluster over here (I consistently get the same first alignment, I have tried some 30 times for both Bismark and Bowtie2 on its own).

In your case, Bowtie2 obviously picks a different alignment as the first random hit and the 15 subsequent tries depending on whether you launch it on its own or from within Bismark. I know that a number of factors play into the random and seeding heuristics of Bowtie2 such as the read position within the file, possibly some parameters specific to your operating system and platform, and sometimes even the read name! (I recall we had a conversion with Ben Langmead at some point that the same read would give an alignment under one condition but would not align if the read was pre-pended with an incrementing number!). How this works exactly is probably a question you would have to ask to the Bowtie2 developers directly I am afraid.

On a different note, I think your original question raises a very valid point: Is there a way we can discriminate between reads that have only a single unique alignment (and this no XS:i: tag), or reads that had many repetitive alignments (governed by the -D tries parameter), where none of the reads actually made it through to be considered a valid secondary alignment. I think I am going to post this as an issue over at the Bowtie2 repo.

@martinjvickers
Copy link
Author

Yeah I understand the difficulties in reproducing the results from these kinds of programs. I've seen the read position one too, where if you give a single read you get something like AS:i:0 XS:i:-14 but within the full fastq file you get AS:i:-14 AS:i:-14.

When seeing the behaviour we've been discussing above my thought was that bowtie2 called by bismark isn't just using those -q --score-min L,0,-0.2 --ignore-quals --nofw parameters and there was a flag set that I was missing and that running bowtie2 directly wasn't producing a like-for-like result.

@FelixKrueger
Copy link
Owner

No I think the command is using the exact same parameters in both circumstances. Just for the record I have raised an issue on the Bowtie2 project for this as well: BenLangmead/bowtie2#114

@martinjvickers
Copy link
Author

Thanks for raising the issue with Bowtie2.

FYI: I've been able to work out the differences between our bowtie2 and bismark runs. It's because you're using the CT converted result that bismark created with bowtie2 which renamed the read name from;

HS3:420:C3EHMACXX:8:1101:10176:30378 1:N:0:ACCTCA

to

HS3:420:C3EHMACXX:8:1101:10176:30378_1:N:0:ACCTCA

and I was using the original read but converting the C's to T's using sed so I never replaced the space with an underscore _. That explains why I wasn't able to get the same results with seemingly the same input.

@FelixKrueger
Copy link
Owner

Excellent, good we got to the bottom of this as well! It appears that Bowtie2 might get an option in the future that will help eliminating these repetitive corner cases in a future release, so yea - well spotted!

@martinjvickers
Copy link
Author

Me too, I honestly thought I was going mad! Yeah I read Ben's comment, looks promising. Glad I could help.

@martinjvickers
Copy link
Author

Hi @FelixKrueger

I have another odd example where bismark is marking a read as ambiguous but I'm not sure it should be. The raw read is this one and it's mapped against TAIR10.

@HS3:420:C3EHMACXX:8:1101:10017:64716
TTTTTTTATTATTTATCGATTAGATTTTGGAAAAGTTAGTAAATTATAAGTATTATTGTTTTTTTGATGGATATTTAGGATTTTTTTAGATTTCGATTTA
+
CCCFFFFFGHHHHJIIJFGIHIDGHIJJFFGHIIIJIJEIIJFIIIIIIAFHGIHIJFIIIIJHF?=A@ACCCAD@DA;ACCDDDDDD9@CDDD?@DDDD

I get one hit back from the CT converted reference with AS:i:-18 and XS:i:-18 and one hit back from the GA converted reference with no XS flag and AS:i:0

HS3:420:C3EHMACXX:8:1101:10017:64716	0	5	11198098	0	100M	*	0	0	TTTTTTTATTATTTATTGATTAGATTTTGGAAAAGTTAGTAAATTATAAGTATTATTGTTTTTTTGATGGATATTTAGGATTTTTTTAGATTTTGATTTA	CCCFFFFFGHHHHJIIJFGIHIDGHIJJFFGHIIIJIJEIIJFIIIIIIAFHGIHIJFIIIIJHF?=A@ACCCAD@DA;ACCDDDDDD9@CDDD?@DDDD	AS:i:-18XS:i:-18	XN:i:0	XM:i:3	XO:i:0	XG:i:0	NM:i:3	MD:Z:25G5G1G66	YT:Z:UU
HS3:420:C3EHMACXX:8:1101:10017:64716	16	5	11363146	42	100M	*	0	0	TAAATCAAAATCTAAAAAAATCCTAAATATCCATCAAAAAAACAATAATACTTATAATTTACTAACTTTTCCAAAATCTAATCAATAAATAATAAAAAAA	DDDD@?DDDC@9DDDDDDCCA;AD@DACCCA@A=?FHJIIIIFJIHIGHFAIIIIIIFJIIEJIJIIIHGFFJJIHGDIHIGFJIIJHHHHGFFFFFCCC	AS:i:0	XN:i:0	XM:i:0	XO:i:0	XG:i:0	NM:i:0	MD:Z:100	YT:Z:UU

Wouldn't this mean that there is one good unique mapping result to the GA converted reference and several inferior ones to the CT converted reference rather than it being an ambiguous read?

@FelixKrueger
Copy link
Owner

Hi Martin, I'll try to take a look at this tomorrow. Thanks, Felix

@FelixKrueger
Copy link
Owner

Hi Martin,

Apologies for taking so long. I did now have time to look at this issue and you were indeed right that the second alignment should trump the first one. I have now changed the timing of when the ambiguous within same thread is reset, and now it appears to work fine. Please can you get the latest development version to see if it works on your end? Fixed in this commit: 40da666.

Many thanks for spotting this, Felix

@martinjvickers
Copy link
Author

Hi Felix,

That's okay, your commit fixed the issue perfectly. Thank you for looking at that. I was hoping to bump into you at yesterdays BBQ at Babraham to say hi in person but I didn't spot you.

Cheers, Martin

@FelixKrueger
Copy link
Owner

FelixKrueger commented Jun 27, 2017 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants