Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Very high memory usage when realigning #27

Closed
mbhall88 opened this issue May 7, 2024 · 5 comments
Closed

Very high memory usage when realigning #27

mbhall88 opened this issue May 7, 2024 · 5 comments

Comments

@mbhall88
Copy link

mbhall88 commented May 7, 2024

Using v2.5.1, when I use the --realign-truth --realign-query options I am getting crazy high memory usage. This is for a bacterial sample, so the genome is ~4MB and the VCF is of negligible size. So far I have had all my jobs fail when requesting 64GB of memory on our cluster. This seems much too high right?

@TimD1
Copy link
Owner

TimD1 commented May 7, 2024

Right now, the main limitation of vcfdist is memory usage. It isn't related to genome or VCF size, but scales quadratically with the size of the largest cluster. What is the largest variant in the VCF? If it's pretty large, I would recommend limiting the max size with -l 5000 or even -l 1000. If that doesn't work, could you send me the VCF and I can investigate?

@mbhall88
Copy link
Author

mbhall88 commented May 8, 2024

Currently I am using --largest-variant 50

Here is an example of a failing sample (they're all failing for all samples to be honest).

test_data.tar.gz

I ran vcfdist from the v2.5.1 docker image with the following command

echo "Calculated maximum QUAL score..." 1>&2
MAX_QUAL=$(bgzip -dc BPH2947__202310.50x.bcftools.filter.vcf.gz | grep -v '^#' | cut -f 6 | sort -gr | sed -n '1p')
echo "MAX_QUAL=$MAX_QUAL" 1>&2
echo "Running vcfdist..." 1>&2
vcfdist BPH2947__202310.50x.bcftools.filter.vcf.gz truth.vcf.gz mutreference.fna --largest-variant 50 \
  --credit-threshold 1.0 -d --realign-truth --realign-query -p BPH2947__202310/BPH2947__202310. \
  -b BPH2947__202310.bed -mx $MAX_QUAL

Let me know if you need anything else.

I should say, without the realign flags (and v2.3.3) this used no more than 500MB of memory

@TimD1
Copy link
Owner

TimD1 commented May 8, 2024

Thanks for sending over the test data! I was able to reproduce your error with large RAM usage, but found the culprit to actually be the optional --distance (-d) calculation.

I'm looking into it now.

@TimD1
Copy link
Owner

TimD1 commented May 8, 2024

This was caused by an indexing error in --distance calculation, and should now be fixed in the new release: v2.5.2.

@TimD1 TimD1 closed this as completed May 8, 2024
@mbhall88
Copy link
Author

mbhall88 commented May 9, 2024

Amazing! Thanks for the quick turnaround. Gotta love bug hunting (I do at least).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants