Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

individual vcfs merging failing #39

Closed
ndussex opened this issue Feb 19, 2023 · 14 comments · Fixed by #41 or #42
Closed

individual vcfs merging failing #39

ndussex opened this issue Feb 19, 2023 · 14 comments · Fixed by #41 or #42
Assignees
Labels
bug Something isn't working documentation Improvements or additions to documentation

Comments

@ndussex
Copy link

ndussex commented Feb 19, 2023

Hi,

I am trying to run step 9 ( 9_merge_vcfs.smk) with the latest version of the pipeline but the first job of this rule (i.e. bcftools merge) always fails without any indication of the possible error in the slurm.out or log files. The first merged.bcf is then empty, only containing the header. It could be a memory issue (68 mammalian genomes), but I am not sure. If I run the rule manually outside the pipeline, the same first step fails as well.

Previously, I would run this rule manually with a slightly different script, but since I want to estimate load with GERP scores, I want to merge the vcfs inside the pipeline.

My guess is that the '-m snps' option is causing this problem in the bcftools merge step:

bcftools merge -m snps -O b -o {output.merged} {input.bcf} 2> {log}

I didn't use it in my manual merging script since the selection of snps only is used in the following step:

bcftools view -m2 -M2 -v snps -Ob -o {output.bcf} {input.bcf} 2> {log} &&

But maybe I am wrong.

Thanks for your help!

@verku verku self-assigned this Feb 20, 2023
@verku verku added the bug Something isn't working label Feb 20, 2023
@verku
Copy link
Collaborator

verku commented Feb 20, 2023

Hi Nic! Thanks for reporting this bug. For me this step works but I've only run the pipeline with a few samples at a time. I'll create a new branch to try out your suggestion, could you test it with your dataset?

@verku
Copy link
Collaborator

verku commented Feb 20, 2023

One more question before I change the code: I guess you have tried to increase the memory for this rule, ideally to the maximum? If I remove the "-m snps" flag, the output file will be very large so I'd like to avoid removing it (without it the file will contain all sites...).

@verku verku linked a pull request Feb 21, 2023 that will close this issue
@ndussex
Copy link
Author

ndussex commented Feb 24, 2023 via email

@ndussex
Copy link
Author

ndussex commented Feb 25, 2023 via email

@verku
Copy link
Collaborator

verku commented Feb 27, 2023

Hi Nic! Thanks for looking into that, I'll try to re-create the error with my testdata and will get back to you.

@verku
Copy link
Collaborator

verku commented Feb 27, 2023

Hi! I might have an idea: did you specify f_missing in the config.yaml file (or F_MISSING in your manual script) as 0 or as 0.0? In bcftools, the F_MISSING parameter has to be a floating point number as it specifies the fraction of allowed missing genotypes. I didn't clarify this in the config.yaml file properly, which I'll fix in the next version.

@verku verku added the documentation Improvements or additions to documentation label Feb 27, 2023
@ndussex
Copy link
Author

ndussex commented Feb 27, 2023 via email

@verku
Copy link
Collaborator

verku commented Feb 27, 2023

Thank you for double checking!

@verku
Copy link
Collaborator

verku commented Feb 27, 2023

Hi! I know what went wrong, 'F_MISSING < 0.0' can't work - it must be 'F_MISSING = 0.0'. I'm working on a solution and will create a new pipeline version as soon as I've tested it thoroughly. I'll let you know when the pipeline is functioning and available!

@verku
Copy link
Collaborator

verku commented Feb 28, 2023

Hi Nic! There is a solution now available on the branch dev that I tested with the test dataset. If you have the time, would you like to test it on your data? Here is the code if you would like to run it in a separate script:

bcftools view -i 'F_MISSING = 0.0' -Oz -o {output.vcf} {input.bcf} 2> {log}

Otherwise, you can run git checkout dev to switch to the branch dev with the new solution and run it in the pipeline.

@ndussex
Copy link
Author

ndussex commented Feb 28, 2023 via email

@verku
Copy link
Collaborator

verku commented Feb 28, 2023

Awesome, thanks!

To clarify, the updated code checks the value of f_missing and runs bcftools in different ways for f_missing = 0.0, f_missing = 1.0, or a value 0.0 < f_missing < 1.0:

https://github.com/NBISweden/GenErode/pull/42/files?file-filters%5B%5D=.smk&show-viewed-files=true

# only include sites with zero missing data
 if [[ `echo 0.0 {params.fmiss} | awk '{{print ($1 == $2)}}'` == 1 ]]
then
  bcftools view -i 'F_MISSING = {params.fmiss}' -Oz -o {output.vcf} {input.bcf} 2> {log}
# include all sites
elif [[ `echo 1.0 {params.fmiss} | awk '{{print ($1 == $2)}}'` == 1 ]]
then
  bcftools view -i 'F_MISSING <= {params.fmiss}' -Oz -o {output.vcf} {input.bcf} 2> {log}
# include sites with less than the fraction f_missing of missing data
elif [[ `echo 0.0 {params.fmiss} 1.0 | awk '{{print ($1 < $2 && $2 < $3)}}'` == 1 ]]
then 
  bcftools view -i 'F_MISSING < {params.fmiss}' -Oz -o {output.vcf} {input.bcf} 2> {log}
fi
        
bcftools index -f {output.vcf} 2>> {log}

@ndussex
Copy link
Author

ndussex commented Mar 2, 2023 via email

@verku
Copy link
Collaborator

verku commented Mar 6, 2023

Thank you for testing the code! I'll just run some final quick checks and will release a new version with the corrected code.

@verku verku closed this as completed in #42 Mar 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working documentation Improvements or additions to documentation
Projects
None yet
2 participants