Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

problem with sambamba #232

Closed
CapitanFlint opened this issue Jul 29, 2019 · 7 comments
Closed

problem with sambamba #232

CapitanFlint opened this issue Jul 29, 2019 · 7 comments

Comments

@CapitanFlint
Copy link

CapitanFlint commented Jul 29, 2019

Hi,

I've installed the new 2.5.0 version. Then, i ran the .sh but told me that sambamba wasn't in my path.
So i installed sambamba by bioconda and made a new environment for it.
I ran the .sh again and i got this:

lun jul 29 14:00:59 -04 2019 Running GRIDSS. The full log is in ./gridss.full.20190729_140059.PC-1.1006.log
lun jul 29 14:00:59 -04 2019 Start pre-processing /home/bromero/2019/data/bamfiles/clean_normalized_bam/costa/Ccost2-8_clean.bam
lun jul 29 14:00:59 -04 2019 CollectInsertSizeMetrics /home/bromero/2019/data/bamfiles/clean_normalized_bam/costa/Ccost2-8_clean.bam first 10000000 records
lun jul 29 14:02:50 -04 2019 CollectGridssMetricsAndExtractSVReads|sambamba /home/bromero/2019/data/bamfiles/clean_normalized_bam/costa/Ccost2-8_clean.bam

Then nothing more happened, it just stopped.
Apparently there is something wrong with sambamba.. or i'm missing something.
Or maybe the anaconda version of sambamba is old and there is some uncompatibility?

Thank you D. Cameron for answer my previous question. The new version is way better than the 2.1.0 that is stored in the anaconda cloud.

Greetings,

Bruno.

EDIT: i checked bioconda sambamba version and is old (0.6.6 vs 0.7.0). The problem is that installing sambamba by myself is a hard task . . so if there is a simpler solution than compile the whole sambamba with the source package please let me know, assuming that sambamba is the problem.

@d-cameron
Copy link
Member

d-cameron commented Jul 30, 2019

jul 29 14:00:59 -04 2019 Running GRIDSS. The full log is in ./gridss.full.20190729_140059.PC-1.1006.log

Did it actually stop? If you've got a large BAM file then each step from that step onward will take hours. I've had feedback from other users about how spammy the full log file is so I opted for the new driver script to only output the very high-level progress. Do you see any progress if you tail -f ./gridss.full.20190729_140059.PC-1.1006.log? Is this new behaviour confusing, or just unexpected because you're used to the spammy output of gridss 2.1.0?

@d-cameron
Copy link
Member

Then nothing more happened, it just stopped.

Just to be clear: that is expected behaviour for the driver script. The status messages every 1,000,000 reads are now in the full log file. Do I need to add a more text to the default output to make this clearer?

@CapitanFlint
Copy link
Author

Thank you Cameron, for you very quick response.
Two days ago, I deleted every gridss related files, including the environment. Then, I downloaded again the new version (just the jar-with-dependencies and gridss.sh files), created a new environment for gridss and move the files into the server (important: i do not have sudo) into a single directory.
I modified the gridss.sh by adding this variables:

workingdir="/2019/data/bamfiles/prueba_20x/working_data"
reference="
/2019/data/gen_ref/Cistanthe_genoma_v3.1a.fasta"
output_vcf="/2019/data/vcf_files/output_gridss/"
assembly="
/2019/data/vcf_files/output_gridss/"
threads=$(nproc)
gridss_jar="~/programas/gridss/gridss-2.5.0-gridss-jar-with-dependencies.jar"
jvmheap="28g"
blacklist=""
metricsrecords=10000000
steps="all"
config_file=""
maxcoverage=50000
labels=""

I ran gridss again with this command:

$ bash gridss_20x.sh --reference ~/2019/data/gen_ref/Cistanthe_genoma_v3.1a.fasta --output ~~/2019/data/vcf_files/output_gridss/Cist3-5_gridss.vcf --assembly ~/2019/data/vcf_files/output_gridss/Cist3-5_gridss.assembly.bam --threads 8 --jar ~/programas/gridss/gridss-2.5.0-gridss-jar-with-dependencies.jar ~/2019/data/bamfiles/prueba_20x/Cist3-5_clean.bam

And told me that 'sambamba is not in my PATH'. So i supposed that sambamba must be in the .jar file (all dependencies), so in /home/ -> nano ./.bashrc i added in the last line:

export PATH="~/programas/gridss/gridss-2.5.0-gridss-jar-with-dependencies.jar:$PATH"

But it doesn't work. I don't know what to do, my knowledge in bioinformatics is very limited.

Given that, i installed sambamba (in the same environment) with:

conda install -c bioconda sambamba

then i ran it again with the same command line and stopped working when the sambamba argument comes into the stream. Two days ago and still in the same process, but everytime when i do 'htop', its never using any memory nor process data.
I checked if there was any progress during the process with tail -f in the working directory but told me that nothing was there.
Thank you again for everything, and sorry if this is too much verbose. I really want to use gridss, because of this: https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1720-5/tables/1

Greets,

Bruno

I ran all processes with 8 (for one sample) cores (2.22 GHz) and the server has 250 GB RAM.

@d-cameron
Copy link
Member

d-cameron commented Aug 2, 2019

Given that, i installed sambamba (in the same environment) with:
conda install -c bioconda sambamba

It appears that there's still something wrong with your environment. What happens when you run sambamba --version on the command line? Maybe your bioconda environment isn't loaded.

If your attempts to get sambamba running in your environment fail, then you're still able to run GRIDSS by calling it directly (it'll just run a bit slower).

The following command line should work for you:

java -ea -Xmx31g \
	-Dsamjdk.create_index=true \
	-Dsamjdk.use_async_io_read_samtools=true \
	-Dsamjdk.use_async_io_write_samtools=true \
	-Dsamjdk.use_async_io_write_tribble=true \
	-Dgridss.gridss.output_to_temp_file=true \
	-cp ~/programas/gridss/gridss-2.5.0-gridss-jar-with-dependencies.jar gridss.CallVariants \
	TMP_DIR=~/2019/data/vcf_files/output_gridss/ \
	WORKING_DIR=~/2019/data/vcf_files/output_gridss/ \
	REFERENCE_SEQUENCE=~/2019/data/gen_ref/Cistanthe_genoma_v3.1a.fasta \
	INPUT=~/2019/data/bamfiles/prueba_20x/Cist3-5_clean.bam \
	OUTPUT=~/2019/data/vcf_files/output_gridss/Cist3-5_gridss.vcf \
	ASSEMBLY=~/2019/data/vcf_files/output_gridss/Cist3-5_gridss.assembly.bam \
	WORKER_THREADS=16 \
	2>&1 | tee -a gridss.$HOSTNAME.$$.log

@d-cameron
Copy link
Member

I modified the gridss.sh by adding this variables:

With the latest gridss.sh, there's no need to modify the script itself, you can do it all from the command line by adding --jvmheap 28g

@d-cameron
Copy link
Member

d-cameron commented Aug 2, 2019

Calling GRIDSS directly still requires R and bwa to be on path, but not sambamba

@CapitanFlint
Copy link
Author

Thank you so much Cameron. Wish you the best. The command line worked nicely, then i used the variantannotation.R and i got DELs,DUPs and so on.
I'm closing this.

Bruno

d-cameron pushed a commit that referenced this issue Aug 7, 2019
…sions of sambamba

Outputting version numbers of all software dependencies
Default JVM heap size reduced to 25g
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants