Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ClinSV script stops at the first step # Create sample info file from bam files ... #23

Closed
jordimaggi opened this issue Mar 15, 2022 · 11 comments

Comments

@jordimaggi
Copy link

Hi,

I am testing ClinSV on a Ubuntu 20.04 VM. I pulled the docker image and tried to run the following command:

sudo docker run kccg/clinsv -r all -i $PWD/WGS/*.bam -ref $PWD/WGS/Reference_hg19/hg19.fa -p $PWD/test_run

The script seems to start correctly, but stops right away at the first task. This is the console output I get:

##############################################
####                ClinSV                ####
##############################################
# 15/03/2022 08:40:29

# clinsv dir: /app/clinsv
# projectDir: /media/analyst/Data/test_run
# sampleInfoFile: /media/analyst/Data/test_run/sampleInfo.txt 
# name stem: test_run
# lumpyBatchSize: 15
# genome reference: /media/analyst/Data/WGS/Reference_hg19/hg19.fa
# run steps: all
# number input bams: 1

# Create sample info file from bam files ...
ln -s  /media/analyst/Data/test_run/alignments//.bam

Any idea where the problem may lie?

Thanks for your help.

@halessi
Copy link

halessi commented Mar 15, 2022

This is my exact problem as well, identical output using singularity.

Cluster reports job as having finished. PLEASE let's figure this out.

NOTE that if you try to run it again, it will work UNTIL a later step, when it looks for the BAM file to have been linked into alignments.

I think it's something to do with the formatting of our BAM headers?

##############################################
####                ClinSV                ####
##############################################
# 15/03/2022 09:33:04

# clinsv dir: /opt/clinsv
# projectDir: /data/LAB_FOLDER/project_folder_using_separate_data_input
# sampleInfoFile: /data/LAB_FOLDER/project_folder_using_separate_data_input/sampleInfo.txt 
# name stem: project_folder_using_separate_data_input
# lumpyBatchSize: 15
# genome reference: /data/LAB_FOLDER/clinsv/refdata-b37
# run steps: all
# number input bams: 44

# Create sample info file from bam files ...
ln -s /vf/users/LAB_FOLDER/BAMs/bqsr-cleaned-SAMPLE.bam /data/LAB_FOLDER/project_folder_using_separate_data_input/alignments/SAMPLE/SAMPLE.bam

I went and tried to see if the ln -s command worked if I ran it manually, and the file was already linked, so it ran successfully and then just quit, so I don't know what is going on.

@halessi
Copy link

halessi commented Mar 15, 2022

@drmjc Any chance you have any insight on this? I think both of us are trying v1.0 (not GRCh38), but your input would be appreciated.

Thanks!!

@drmjc
Copy link
Member

drmjc commented Mar 15, 2022 via email

@halessi
Copy link

halessi commented Mar 15, 2022

Thank you for the reply.

This would make sense -- if the BAM headers or something are formatted differently w/ hg19, then it would follow that ClinSV fails to link the files (if this data were necessary or it ignores improperly formatted data).

So, in order to use hg19 I will need to wait for v1.1, is that correct?

Thanks again!

@drmjc
Copy link
Member

drmjc commented Mar 15, 2022 via email

@halessi
Copy link

halessi commented Mar 21, 2022

Update: was able to fix the linking issue at the start of ClinSV by fixing my bam.bai files --> I had the .bam.bai files soft linked to the .bai, which ClinSV didn't like. By creating hard links from .bai to .bam.bai files, I was able to resolve this issue.

@drmjc
Copy link
Member

drmjc commented Mar 22, 2022

how intriguing, thanks for the update.

@J-Bradlee, please note this & we should test with

  1. test.bam + test.bai
  2. test.bam + test.bam.bai

Both forms of naming the bai index file are acceptable in practice (even though the SAM specs don't define this).

@halessi
Copy link

halessi commented Apr 10, 2022

@drmjc -- just a quick question. Does annotation often take upwards of 4+ days? For 45 BAMs, my annotation phase has been going for 4.5 days at this point. Not sure if that's expected or not (200gb RAM, 32 CPUs).

Thank you!

Hugh

@J-Bradlee
Copy link
Collaborator

Hi @halessi, thought I would jump in here and say that for a single 72gb BAM file it took at least 24 hours to run through all of ClinSV's steps on a similarly spec machine as yours. It also took around 6 hours to finish all the steps for a single 6gb BAM file. Roughly what is the total size for all 45 of your BAM files?

@halessi
Copy link

halessi commented Apr 11, 2022

@J-Bradlee Thanks so much for your reply.

I would guess about ~650GB would be the total size for all BAM files. Maybe this was too large of a run? I would estimate total running time at this point for all steps to be in the 10 day range, so perhaps I should have split this up more effectively...

Anyways, I guess it sounds like this amount of time isn't crazy. But I'm a little worried it's going to take like 20 days at this point...

Can you speak a bit more on the distribution of time? I.e., for your 72gb BAM run, was the majority of it during lumpy/cnvator?

Note that I originally provided ClinSV with even more resources (64 cpus, I think 400gb of RAM?) but the job was killed due to a cluster error, and it didn't seem like ClinSV was even eating up anywhere near that much, so I cut it back for resuming the job.

Thank you!

Hugh

@J-Bradlee
Copy link
Collaborator

J-Bradlee commented Apr 11, 2022

No problem @halessi .

Most of the time is spent on the bigwig step followed by the annotation and then CNVnator steps. Below is my output of a successful run for a subsampled 6gb BAM file. Hopefully it can give you a rough idea of how long it would take for your BAM files.

Note this is being used with ClinSV v1.0 with reference genome b38. However I think it should give similar duration to v0.9's b37 ref genome

##############################################
####                ClinSV                ####
##############################################
# 28/03/2022 18:25:00

# clinsv dir: /app/clinsv
# projectDir: /app/project_folder
# sampleInfoFile: /app/project_folder/sampleInfo.txt 
# name stem: project_folder
# lumpyBatchSize: 5
# genome reference: /app/ref-data/refdata-b38
# run steps: all
# number input bams: 1

# Create sample info file from bam files ...
ln -s /app/input/NA12878.grch38.subsampled.bam /app/project_folder/alignments/FR05812606/FR05812606.bam
ln -s /app/input/NA12878.grch38.subsampled.bam.bai /app/project_folder/alignments/FR05812606/FR05812606.bam.bai
# Read Sample Info from /app/project_folder/sampleInfo.txt
# use: FR05812606       H7LH3CCXX_6             /app/input/NA12878.grch38.subsampled.bam
# 1 samples to process
# If not, please exit make a copy of sampleInfo.txt, modify it and rerun with -s sampleInfo_mod.txt pointing to the new sample info file. 

###### Generate the commands and scripts ######

# bigwig

# lumpy

# cnvnator

# annotate

# prioritize

# qc

###### Run jobs ######

 ### executing: sh /app/project_folder/alignments/FR05812606/bw/sh/bigwig.createWigs.FR05812606.sh &> /app/project_folder/alignments/FR05812606/bw/sh/bigwig.createWigs.FR05812606.e  ...  

 ### finished after (hh:mm:ss): 01:31:33
 ### exist status: 0

 ### executing: sh /app/project_folder/alignments/FR05812606/bw/sh/bigwig.q0.FR05812606.sh &> /app/project_folder/alignments/FR05812606/bw/sh/bigwig.q0.FR05812606.e  ...  

 ### finished after (hh:mm:ss): 00:37:20
 ### exist status: 0

 ### executing: sh /app/project_folder/alignments/FR05812606/bw/sh/bigwig.q20.FR05812606.sh &> /app/project_folder/alignments/FR05812606/bw/sh/bigwig.q20.FR05812606.e  ...  

 ### finished after (hh:mm:ss): 00:36:05
 ### exist status: 0

 ### executing: sh /app/project_folder/alignments/FR05812606/bw/sh/bigwig.mq.FR05812606.sh &> /app/project_folder/alignments/FR05812606/bw/sh/bigwig.mq.FR05812606.e  ...  

 ### finished after (hh:mm:ss): 00:37:10
 ### exist status: 0

 ### executing: sh /app/project_folder/SVs/FR05812606/lumpy/sh/lumpy.preproc.FR05812606.sh &> /app/project_folder/SVs/FR05812606/lumpy/sh/lumpy.preproc.FR05812606.e  ...  

 ### finished after (hh:mm:ss): 00:12:51
 ### exist status: 0

 ### executing: sh /app/project_folder/SVs/joined/lumpy/sh/lumpy.caller.joined.sh &> /app/project_folder/SVs/joined/lumpy/sh/lumpy.caller.joined.e  ...  

 ### finished after (hh:mm:ss): 00:26:51
 ### exist status: 0

 ### executing: sh /app/project_folder/SVs/joined/lumpy/sh/lumpy.depth.joined.sh &> /app/project_folder/SVs/joined/lumpy/sh/lumpy.depth.joined.e  ...  

 ### finished after (hh:mm:ss): 00:00:54
 ### exist status: 0

 ### executing: sh /app/project_folder/SVs/FR05812606/cnvnator/sh/cnvnator.caller.FR05812606.sh &> /app/project_folder/SVs/FR05812606/cnvnator/sh/cnvnator.caller.FR05812606.e  ...  

 ### finished after (hh:mm:ss): 00:56:31
 ### exist status: 0

 ### executing: sh /app/project_folder/SVs/joined/sh/annotate.main.joined.sh &> /app/project_folder/SVs/joined/sh/annotate.main.joined.e  ...  

 ### finished after (hh:mm:ss): 01:27:03
 ### exist status: 0

 ### executing: sh /app/project_folder/SVs/joined/sh/prioritize.main.joined.sh &> /app/project_folder/SVs/joined/sh/prioritize.main.joined.e  ...  

 ### finished after (hh:mm:ss): 00:00:07
 ### exist status: 0

 ### executing: sh /app/project_folder/SVs/qc/sh/qc.main.joined.sh &> /app/project_folder/SVs/qc/sh/qc.main.joined.e  ...  

 ### finished after (hh:mm:ss): 00:00:48
 ### exist status: 0

# 29/03/2022 00:52:13 Project project_folder project_folder | Total jobs 11 | Remaining jobs 0 | Remaining steps bigwig,lumpy,cnvnator,annotate,prioritize,qc  11 | Total time: 386 min

# 29/03/2022 00:52:13 Project project_folder project_folder | Total jobs 11 | Remaining jobs 0 | Remaining steps   0 | Total time: 386 min

# Everything done! Exit

# writing igv session files...

xml file: /app/project_folder/igv/FR05812606.xml

I also want to add, that you may experience even slower times for the CNVnator section as the job resources are hard coded to 16 cpus and 30 gb of memory. See the source code line here. So it is not using all the resources that are available to it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants