Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chromosomal sequence could not be extracted error #68

Closed
MichaelMW opened this issue Oct 14, 2016 · 8 comments
Closed

chromosomal sequence could not be extracted error #68

MichaelMW opened this issue Oct 14, 2016 · 8 comments

Comments

@MichaelMW
Copy link

MichaelMW commented Oct 14, 2016

I'm using a customized reference sequence.

When running bismark, I'm getting tons of error like "Chromosomal sequence could not be extracted for ... BIA 1"

Is it because my reference file is too short?
One example entry looks like this:

BIA
TTGGAGTTGAGTGGGTTGATGTAGGGTGTGACTAGTCGCCTGAGTGCCCAAAAGACGCCCTCAAGGACAGGGAAGCAGCTGCGATTCACTCAGCGATCACCAGCGCCCAGAACACCGTTTTTCAGGGCCGCCCAAGCGGAGGAGCAACACGTGGCCCAGCAAAAGGAGCTGGGTCCCGTGCAGCACAGCAACAAGAAACTGGCCTAGGAAGACGCTGGAGCGTCTTTCTTTCGCACGGAATTCAAGAGGGAGGAGGAGTAGGAAGGATTAGGTG

I'm using the default bismark_genome_preparation and bismark

bismark_genome_preparation --path_to_bowtie ~/tools/bowtie2-2.2.9/ --verbose myRef

Attached is my reference, as well as the bismark log.

myRef.fasta.txt
1.log.txt

Thanks in advance!

Michael

@FelixKrueger
Copy link
Owner

FelixKrueger commented Oct 14, 2016

Hi Michael,

You normally only see this warning message very rarely whenever a sequence aligns right to the vert end (or start for reverse sequences) of a reference genome, and typically this is on the MT. The error arises because Bismark would like to extract 2 bp further downstream of the sequence to determine the cytosine context but this is obviously not possible when there are no further 2bp available. If you are using very short sequences, such as amplicons, you may get these sequences a lot more frequently.

As a quick fix you could pad the sequences in your genome file with 2 bases of your liking on either side, e.g.:
>BIA NNTTGGAGTTGAGTGGGTTGATGTAGGGTGTGACTAGTCGCCTGAGTGCCCAAAAGACGCCCTCAAGGACAGGGAAGCAGCTGCGATTCACTCAGCGATCACCAGCGCCCAGAACACCGTTTTTCAGGGCCGCCCAAGCGGAGGAGCAACACGTGGCCCAGCAAAAGGAGCTGGGTCCCGTGCAGCACAGCAACAAGAAACTGGCCTAGGAAGACGCTGGAGCGTCTTTCTTTCGCACGGAATTCAAGAGGGAGGAGGAGTAGGAAGGATTAGGTGNN

If you use NN you make sure that the sequence context of terminal Cs would be found as Unknown which does not obfuscate further downstream events. I hope this helps,
Best, Felix

@MichaelMW
Copy link
Author

Fixed! Thanks again!

@FelixKrueger
Copy link
Owner

Wow that was quick! Thanks, Felix

@sansense
Copy link

sansense commented Dec 22, 2016

Dear Felix,
Did you suggest **NN** to pad the sequences (vs. NN) or is it just a formatting issue of github? What is the meaning of '**' in the former case?

I am doing amplicon sequencing and facing the same problem. I tried padding NN, and it worked for some sequences but then it didn't for some other. After using "**NN**", it works for many of the other sequences, however, it still shows the error "chromosomal sequence could not be extracted error". Could you elaborate further on this? My specific Q: is this error specific to NOT getting the cytosine context at the end of reference, or can it also occur in other case like when the read only partially maps to reference (and so there is not enough useful sequence remaining)?

@FelixKrueger
Copy link
Owner

Hi sansense, the ** ** is indeed the markdown way of making things bold, but apparently this doesn't work in a code block. I have therefore edited it out of the previous comment.

And yes this should only appear for when sequences align to the very edges of chromosomes and/or scaffolds (as Bismark does not perform soft-clipping). When you say you have padded your sequences, did you do this on the start and end?

@sansense
Copy link

Yes, I did it at both start and the end, but still get the same error

@sansense
Copy link

Since Bismark doesnt do soft clipping, do you suggest to pad more than 2-bases? Optimally how many? My read length is 301

@FelixKrueger
Copy link
Owner

The padding at short sequences really only affects the context calling. If your sequences or reads are too long in general they should simply not align at all. Maybe you should go and check in a bit more detail for one or two sequences what is going on, Bismark should report the ID of the sequence so you can grep for the sequence in the FastQ file (e.g. using zcat file.fastq.gz | grep -A 3 readID > test.fastq), and then align this sequence on its own (or using Bowtie2 against the Bismark converted genome sequence). I hope this helps.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants