Reads Can Map Entirely Beyond Reference Sequence End #48

Open
DarioS opened this Issue Apr 24, 2017 · 7 comments

Comments

Projects
None yet
3 participants

DarioS commented Apr 24, 2017

I have a set of thousands of reference sequences with high sequence similarity (i.e. alleles of HLA genes). I notice that bowtie sometimes maps reads beyond the ends of a small number of reference sequences. If I make a minimal example with only one reference sequence and one pair of reads that were mapped beyond the boundary of that particular reference sequence, then bowtie doesn't align the read pair to the reference sequence it mistakenly did before (the read is reported as unaligned). I used the command bowtie -v 0 -a -S indexes/IMGT-HLA/hla -1 R1.fq -2 R2.fq  test.sam. I used version 1.2 downloaded from the website which is pre-compiled for Linux.

beyond

Owner

BenLangmead commented Apr 24, 2017

That's strange. We have to be able to reproduce before we can investigate, though. I know your attempt to find a minimal example wasn't successful, but can you share any example (reads+reference) where this happens?

Owner

BenLangmead commented Apr 25, 2017

Is it possible you have multiple sequences with the same name, and that that is confusing IGV?

DarioS commented Apr 26, 2017

No, it's recorded like that in the SAM file. What is strange about the FASTA reference file is that there are entries with different names but identical nucleotide sequence present. The alleles have only the protein coding region (CDS) sequence recorded, but the difference between two alleles may biologically occur in the UTRs. So, they end up with different allele IDs but identical nucleotide sequences.

Could bowtie-build do a check for this before allowing the indexing to proceed and silently producing weird mappings the end user has no warning of? It has to read in all of the references anyway, so it would be a quick calculation to check that they're all unique nucleotide sequences.

Owner

BenLangmead commented Apr 26, 2017

Can you share any example (reads+reference) where this happens?

DarioS commented Apr 27, 2017

Yes.

Step 1: Download reference file from ftp://ftp.ebi.ac.uk/pub/databases/ipd/imgt/hla/hla_nuc.fasta Generate a Bowtie index for it.
Step 2: Download the test FASTQ file (10 MB).
Step 3: Map using the command bowtie -v 0 -a -S indexes/bowtie/hla test.fastq test.sam
Step 4: Consider an allele like HLA:HLA01298. Bowtie finds some alignments entirely beyond the end of it and the ones within the boundaries of the reference sequence are almost all mismatches (if IGV is set to Show Mismatched Bases). Most other alleles look perfectly fine. Needleman-Wunsch alignment of some of the reads with almost all mismatches shows 100% match the the same allele, but with the location shifted slightly, therefore Bowtie is reporting incorrect start and end read location for this allele.

@BenLangmead BenLangmead assigned BenLangmead and ch4rr0 and unassigned BenLangmead May 2, 2017

Collaborator

ch4rr0 commented Jun 5, 2017

Hello, I am looking into this issue. The link provided is taking me to a website that requires registration. Would it be possible to share the FASTQ file on dropbox?

DarioS commented Jun 6, 2017

I changed the permissions to make it publicly accessible. Please click the link again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment