Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

simplest vs bfile_path SAMPLESHEET #77

Closed
olatzu opened this issue Jan 20, 2023 · 5 comments
Closed

simplest vs bfile_path SAMPLESHEET #77

olatzu opened this issue Jan 20, 2023 · 5 comments
Labels
user-query User queries & requests

Comments

@olatzu
Copy link

olatzu commented Jan 20, 2023

Hello,

Thanks again for the amazing tool and the amazing documentation.

I have been preparing the samplesheet and I believe I am getting the "sampleset" text string wrong.

I have my population in plink.bed plink.fam plink.bim

I understand the "bfile_path" is :

/Users/myname/Desktop/PLINK/plink_mac/plink_genome_test1

Then the "sampleset" could then be "plink_genome_test1"? or how does the sampleset should be looking like?

Thanks a lot for your time in advance and apologies for the inconvenience

Best

@smlmbrt
Copy link
Member

smlmbrt commented Jan 23, 2023

An example of the sample sheet is here: https://github.com/PGScatalog/pgsc_calc/blob/main/assets/examples/samplesheet.csv

If your .bed/.fam/.bim files all start with plink_genome_test1 and contain all chromosomes your sample sheet would be:

sampleset,vcf_path,bfile_path,pfile_path,chrom
plink_genome_test1,,/Users/myname/Desktop/PLINK/plink_mac/plink_genome_test1,,

If it is split across chromosomes the files should have slightly longer root names but the same sampleset ID. An example for chrs 1 and 2:

sampleset,vcf_path,bfile_path,pfile_path,chrom
plink_genome_test1,,/Users/myname/Desktop/PLINK/plink_mac/plink_genome_test1_chr1,,1
plink_genome_test1,,/Users/myname/Desktop/PLINK/plink_mac/plink_genome_test1_chr2,,2
[...]

@smlmbrt smlmbrt added the user-query User queries & requests label Jan 23, 2023
@olatzu
Copy link
Author

olatzu commented Jan 23, 2023

Dear @smlmbrt ,

Thanks a lot for the quick response! Now worked! The problem seems now to be that "the Score PGS001927_hmPOS_GRCh37 fails minimum matching threshold (22.01% variants match)" is there any way to overcome that? (already tried changing the Ch38 which is even less..) thanks a lot again!

@smlmbrt
Copy link
Member

smlmbrt commented Jan 23, 2023

You can adjust the --min_overlap flag to score on the available 22% of variants un the score; however, it's probably best to investigate why the data is missing so many variants. Some options:

@olatzu
Copy link
Author

olatzu commented Jan 23, 2023

Dear @smlmbrt ,

Thank you again very much for the info and the prompt reply! It now seems to have worked. However, I get the error:

Error: --score variant ID '1:89479074:C:T' appears multiple times in main
dataset.

Weird enough I have checked the .bim files and the PGS and the variant is not there(?)

Thank you very much again!

@smlmbrt
Copy link
Member

smlmbrt commented Jan 23, 2023

This is because the pipeline relabels the variants for consistency and scoring file formatting. If you do a grep/lookup by position in the .bim file you'll likely see multiple rows with those alleles. Implementing the response in this issue may fix this problem: #74 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
user-query User queries & requests
Projects
None yet
Development

No branches or pull requests

3 participants