Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bcbio Vardict 1.4.5 Exception in thread "main" java.lang.StringIndexOutOfBoundsException: #34

Closed
mortunco opened this issue Apr 4, 2016 · 9 comments

Comments

@mortunco
Copy link

mortunco commented Apr 4, 2016

Hi,

I am trying to call somatic mutations using vardict with bcbio pipeline system. I came across with this error during the process. Even though it has called mutations for chr 1-15, i dont know what might have caused this error. I also checked the closed answers but i couldnt find and answer. It was concluded that new version was released.

Thank you for your help,

Best,

Tunc.

My vardict version is;

vardict,2016.02.19
vardict-java,1.4.5
Exception in thread "main" java.lang.StringIndexOutOfBoundsException: String index out of range: -1
    at java.lang.String.charAt(String.java:658)
    at com.astrazeneca.vardict.VarDict.parseSAM(VarDict.java:1896)
    at com.astrazeneca.vardict.VarDict.toVars(VarDict.java:2552)
    at com.astrazeneca.vardict.VarDict.somaticNotParallel(VarDict.java:354)
    at com.astrazeneca.vardict.VarDict.nonAmpVardict(VarDict.java:249)
    at com.astrazeneca.vardict.VarDict.start(VarDict.java:65)
    at com.astrazeneca.vardict.Main.run(Main.java:134)
    at com.astrazeneca.vardict.Main.main(Main.java:25)
' returned non-zero exit status 1

@mjafin
Copy link
Contributor

mjafin commented Apr 4, 2016

Related bcbio/bcbio-nextgen#1300

@tfenne
Copy link
Contributor

tfenne commented Apr 6, 2016

I'm seeing the same error running VarDict 1.4.5. I modified the VarDict source to spit out the read name and cigar string of the read it was processing when the exception occurred, and in my case the read's cigar started with an insertion operator (e.g. 2I98M). That's technically valid SAM - it's just unusual, and I suspect that's what's causing the problem.

My solution is to avoid running the GATK Indel Realigner on data I'm feeding to VarDict, but it would be nice if VarDict handled this case, either by filtering those reads, turning the leading I operator into a soft-clip (S) or something else.

@chapmanb
Copy link
Contributor

chapmanb commented Apr 7, 2016

Tim -- thanks so much for identifying the underlying issue. Would it be possible to generate a minimal test case with a problem record? Miika and Zhongwu, it would be great to handle this in VarDict as well. Not using realigner is a workaround but frustrating to end users and not ideal as some people just like to realign. If it's a hard problem to fix we can re-evaluate a workaround in bcbio. Thanks so much.

@mortunco
Copy link
Author

mortunco commented Apr 17, 2016

Dear @mjafin;

I tried to ran what you asked from me with the followed configuration file. Apparently there is a problem with bra right now. Because I couldnt even get to the vardict process initiation.

I think we should be advised to @chapmanb about BWA error ?

Thanks all for the time and answer,

Best,

Tunc.

Configuration file

[ec2-user@ip-172-31-57-166 config]$ cat real_deneme_2_vardictonly.yaml 
details:
- algorithm:
    aligner: bwa
    mark_duplicates: false
    realign: false
    recalibrate: false
    variantcaller: [vardict]
    indelcaller: #pindel
    ensemble:
        numpass: 2
  analysis: variant2
  description: denem1_normal
  files:
  - /home/ec2-user/puppey/icgc_data/try1/real_deneme/input/normal.bam
  genome_build: GRCh37
  metadata:
    batch: syn3
    phenotype: normal
- algorithm:
    aligner: bwa
    mark_duplicates: false
    realign: false
    recalibrate: false
    variantcaller: [vardict]
    indelcaller: #pindel
    ensemble:
        numpass: 2
  analysis: variant2
  description: deneme1_tumor
  files:
  - /home/ec2-user/puppey/icgc_data/try1/real_deneme/input/tumor.bam
  genome_build: GRCh37
  metadata:
    batch: syn3
    phenotype: tumor
fc_date: '2016-03-24'
fc_name: real_deneme
upload:
  dir: ../final

bcbio-debug.log


[ec2-user@ip-172-31-57-166 log]$ tail -n 100 bcbio-nextgen.log 
[2016-04-06T07:55Z] Timing: organize samples
[2016-04-06T07:55Z] multiprocessing: organize_samples
[2016-04-06T07:55Z] Using input YAML configuration: /home/ec2-user/puppey/icgc_data/try1/real_deneme/config/real_deneme_2_vardictonly.yaml
[2016-04-06T07:55Z] Checking sample YAML configuration: /home/ec2-user/puppey/icgc_data/try1/real_deneme/config/real_deneme_2_vardictonly.yaml
[2016-04-06T07:55Z] Testing minimum versions of installed programs
[2016-04-06T07:55Z] Timing: alignment preparation
[2016-04-06T07:55Z] multiprocessing: prep_align_inputs
[2016-04-06T15:21Z] multiprocessing: disambiguate_split
[2016-04-06T15:21Z] Timing: alignment
[2016-04-06T15:21Z] multiprocessing: process_alignment
[2016-04-06T15:21Z] Aligning lane 1_2016-03-24_real_deneme with bwa aligner
[2016-04-10T10:30Z] Aligning lane 2_2016-03-24_real_deneme with bwa aligner
[2016-04-16T00:58Z] Uncaught exception occurred
Traceback (most recent call last):
  File "/usr/local/share/bcbio/anaconda/lib/python2.7/site-packages/bcbio/provenance/do.py", line 21, in run
    _do_run(cmd, checks, log_stdout)
  File "/usr/local/share/bcbio/anaconda/lib/python2.7/site-packages/bcbio/provenance/do.py", line 95, in _do_run
    raise subprocess.CalledProcessError(exitcode, error_msg)
CalledProcessError: Command 'set -o pipefail; /usr/local/bin/sambamba index -t 4 /home/ec2-user/puppey/icgc_data/try1/real_deneme/work/align/deneme1_tumor/tx/tmpOC2ghi/2_2016-03-24_real_deneme-sort.bam
sambamba-index: Error reading BGZF block starting from offset 135039898169: stream error: not enough data in stream

To prove that I am not going out of space ?

[ec2-user@ip-172-31-57-166 real_deneme]$ df
Filesystem      1K-blocks      Used Available Use% Mounted on
/dev/xvda1      103079180  56922580  46056352  56% /
devtmpfs         15701052       100  15700952   1% /dev
tmpfs            15709624         0  15709624   0% /dev/shm
/dev/xvdg      1031992064 766774616 212772264  79% /home/ec2-user/puppey

@chapmanb
Copy link
Contributor

Tunc;
Sorry about the running issues. This looks like something is wrong with this alignment output file from bcbio:

/home/ec2-user/puppey/icgc_data/try1/real_deneme/work/align/deneme1_tumor/2_2016-03-24_real_deneme-sort.bam

since indexing on it failed. It's tough to guess what happened with this information but the things I'd try to debug would be:

  1. Try re-running in place to see if it was a read issue and things progress cleanly.
  2. If that fails, try removing the problem BAM files and re-running to see if it was an issue generating it.
  3. If that fails, look through the output logs from the BAM generation re-run to see if you can spot any issues.

Hopefully it's a one-off filesystem error and re-running with fix it. Hope this helps.

@mortunco
Copy link
Author

Hi Brad;

Thank you for the fast answer. I need to ask couple things to clarify;

1.Did you mean re run by simple, restarting the process. (I think this will continue the process?)
2. Remove that bam from the tmp file and continue like at the 1st advice?

Thanks,

Tunc.

@chapmanb
Copy link
Contributor

Tunc;
You're right on. For 1, just restart which continues where it left off. For 2, do rm -f /home/ec2-user/puppey/icgc_data/try1/real_deneme/work/align/deneme1_tumor/2_2016-03-24_real_deneme-sort.bam and re-start. Hope this helps.

@tfenne
Copy link
Contributor

tfenne commented Sep 21, 2016

I just opened up PR #50 to fix this. I ran into it again when I started running a tool I wrote that clips primer sequences from reads. This led to leading/trailing insertion operators in the cigar (which I want to preserve for other tools), sometimes with soft or hard clipping outside of the insertion.

@mjafin
Copy link
Contributor

mjafin commented Feb 15, 2017

I'm closing this issue now as PR #50 should be shortly addressed.

@mjafin mjafin closed this as completed Feb 15, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants