Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error: --score variant ID '2:168745247:A:AT' appears multiple times in main dataset. #74

Closed
asmaa-a-abdelwahab opened this issue Jan 16, 2023 · 6 comments
Labels
documentation Improvements or additions to documentation user-query User queries & requests

Comments

@asmaa-a-abdelwahab
Copy link

I got this error in the PLINK2_SCORE process. Here is the command log:

PLINK v2.00a3.3LM 64-bit Intel (3 Jun 2022)    www.cog-genomics.org/plink/2.0/
(C) 2005-2022 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to Population_Report_ALL_additive_0.log.
Options in effect:
  --memory 204800
  --out Population_Report_ALL_additive_0
  --pfile vzs vcf_Population_Report_ALL
  --score Population-Report_ALL_additive_0.scorefile.gz zs header-read cols=+scoresums,+denom,-fid
  --score-col-nums 3-803
  --seed 31
  --threads 60

Start time: Thu Jan 12 11:55:05 2023
241801 MiB RAM detected; reserving 204800 MiB for main workspace.
Using up to 60 threads (change this with --threads).
399 samples (0 females, 0 males, 399 ambiguous; 399 founders) loaded from
vcf_Population_Report_ALL.psam.
57786076 variants loaded from vcf_Population_Report_ALL.pvar.zst.
Note: No phenotype data present.
Calculating allele frequencies... done.

Error: --score variant ID '2:168745247:A:AT' appears multiple times in main
dataset.
End time: Thu Jan 12 11:55:29 2023
@nebfield
Copy link
Member

nebfield commented Jan 16, 2023

Hello,

Thanks for reporting the issue. It looks like when your VCF is converted into plink2 format, multiple variants have the same ID. This will cause a plink error when calculating a score.

One way to fix this is to edit a configuration file conf/modules.config. Change the following block:

    withName: PLINK2_VCF {
        ext.args = "--new-id-max-allele-len 100 missing"
    }

To:

    withName: PLINK2_VCF {
        ext.args = "--new-id-max-allele-len 100 missing --rm-dup force-first"
    }

This will deduplicate your variants when your VCF is recoded as a plink2 file. This should fix your error but you will lose some variants from the original VCF. You might want to check the plink documentation for other strategies.

plink2 -help  --rm-dup

Please let me know if you have any problems or questions.

@smlmbrt smlmbrt added the user-query User queries & requests label Jan 16, 2023
@asmaa-a-abdelwahab
Copy link
Author

asmaa-a-abdelwahab commented Jan 17, 2023

@nebfield Thank you. The error is solved but I got another error:
Error: VCF file has a variant with 257 ALT alleles; this build of plink2 is limited to 254

@nebfield
Copy link
Member

That's an error I've never seen before 😅 Some extra plink parameters may help:

withName: PLINK2_VCF {
    ext.args = "--new-id-max-allele-len 100 missing --rm-dup force-first --max-alleles 254"
}

I picked 254 based on your error message. If you want to remove multi-allelic variants you can set --max-alleles 2.

@asmaa-a-abdelwahab
Copy link
Author

asmaa-a-abdelwahab commented Jan 18, 2023

@nebfield Thanks for your reply. I already tried that. The executed command by default has --max-alleles 2 when I add --max-alleles 254 as an extra parameter, I get an error that the parameter is duplicated

Command executed:

  plink2 \
      --threads 60 \
      --memory 204800 \
      --set-all-var-ids '@:#:$r:$a' \
      --max-alleles 2 \
      --new-id-max-allele-len 100 truncate --rm-dup force-first \
      --vcf Merged_Population.vcf.gz  \
      --make-pgen vzs\
      --out vcf_Population_Report_ALL # 'vcf_' prefix is important
  
  cat <<-END_VERSIONS > versions.yml
  "PGSCATALOG_PGSCALC:PGSCALC:MAKE_COMPATIBLE:PLINK2_VCF":
      plink2: $(plink2 --version 2>&1 | sed 's/^PLINK v//; s/ 64.*$//' )
  END_VERSIONS

The error:

Command error:
  Error: Duplicate --max-alleles flag.

@nebfield
Copy link
Member

Sorry, I forgot that flag was already included. It might be best to filter very complex multiallelic variants from your VCF using bcftools and using the filtered VCF as input to pgsc_calc, because plink is having trouble recoding your VCF into plink2 pfile format.

@asmaa-a-abdelwahab
Copy link
Author

@nebfield Hello,
Sorry for the late reply and thanks for your recommendation, it did work for me

@nebfield nebfield added the documentation Improvements or additions to documentation label Jan 23, 2023
@smlmbrt smlmbrt closed this as completed Jan 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation user-query User queries & requests
Projects
None yet
Development

No branches or pull requests

3 participants