Incorrect results when using large index #35

Open
mluypaert opened this Issue Nov 2, 2016 · 7 comments

Comments

Projects
None yet
2 participants

I found that when mapping a pair of reads using bowtie 1 (tested with 1.1.1 and 1.1.2), the results when using a large index (a truly large index, not a large-index created from a small fasta) are incorrect.

The command used here was:

bowtie-1.1.1/bowtie --tryhard -a --quiet -X 2000 -v 3 --12 /var/tmp/job_o45b8iO6CN_bowtieIn.tmp Ensembl/Homo_sapiens/release-84/cdna_unspliced/lib > bowtie_o45b8iO6CN_complete_largeidx_output.txt

To prove that the large index was the problem, I split the input fasta I used to generate the large index in two parts, and generated small indexes from each part. Then I ran the same bowtie analysis on each small-index part and concatenated the results.

The two-part small index analysis returned correct results (see bowtie_o45b8iO6CN_split_merged_output.txt), returning perfect matches on ENST00000551148, ENST00000549155, ENST00000546991, ENST00000392979 and ENST00000392977.
The large-index analysis however, returned

  • a perfect match for ENST00000551148 and ENST00000549155
  • a match with mismatches for ENST00000392977
  • no match for ENST00000546991 and ENST00000392979

This page proves that the read-pair from the input file should match perfectly to the five above mentioned transcripts.

Can anyone please look into this?

ps. I would add the fasta-file I used to generate this large index but the file is 5.8Gb so not fit for upload to github. If you want it, let me know, then we'll see how I could share it.
bowtie_o45b8iO6CN_complete_largeidx_output.txt
bowtie_o45b8iO6CN_split_merged_output.txt
job_o45b8iO6CN_bowtieIn.txt

Collaborator

ch4rr0 commented Nov 2, 2016

I would like to recreate this issue. It will be helpful to have the fasta file and the command line used for generating the small indexes as well.

Okay, this dropbox link is the complete fastafile, which I used for the large index (I will disable this link later on).

The file was split in two using this command: split -l 161150 -a 1 -d cdna_unspliced.fa splitfile_part

And the bowtie indexes were all generated with the simplest version of the bowtie-build command:
For the large index:

bowtie-build cdna_unspliced.fa large_index

And for the small indexes:

bowtie-build splitfile_part0 small_index_part1
bowtie-build splitfile_part1 small_index_part2

@ch4rr0 Did you manage to recreate the issue?

Collaborator

ch4rr0 commented Nov 27, 2016

Hello @mluypaert,
My apologies for the delayed response.

Yes I was able to recreate the problem. Unfortunately the work required to have this resolved is nontrivial, so we will delay this fix until the next bowtie release. We currently don't have a timeline for when this release will be, but I will keep you posted as soon as it's decided.

Hello @ch4rr0

I see on the bowtie website that version 1.2.0 was released on 12/12/2016. However I don't see anything in the changelog about the large-index bugfixing (this issue). I assume this was not fixed in release 1.2.0 yet? Any idea on what release would contain this fix and when to expect it to go public?

@ch4rr0 @BenLangmead Can you please give me an update on this issue? Are there any timelines on when we could expect this issue to be fixed (by when and in what release)?

Collaborator

ch4rr0 commented Jun 2, 2017

I am sorry for letting this one slip through the cracks. I have started looking into the issue and will reply to this thread as I progress.

ch4rr0 was assigned by BenLangmead Jul 6, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment