simplest vs bfile_path SAMPLESHEET #77

olatzu · 2023-01-20T17:50:05Z

Hello,

Thanks again for the amazing tool and the amazing documentation.

I have been preparing the samplesheet and I believe I am getting the "sampleset" text string wrong.

I have my population in plink.bed plink.fam plink.bim

I understand the "bfile_path" is :

/Users/myname/Desktop/PLINK/plink_mac/plink_genome_test1

Then the "sampleset" could then be "plink_genome_test1"? or how does the sampleset should be looking like?

Thanks a lot for your time in advance and apologies for the inconvenience

Best

smlmbrt · 2023-01-23T14:06:29Z

An example of the sample sheet is here: https://github.com/PGScatalog/pgsc_calc/blob/main/assets/examples/samplesheet.csv

If your .bed/.fam/.bim files all start with plink_genome_test1 and contain all chromosomes your sample sheet would be:

sampleset,vcf_path,bfile_path,pfile_path,chrom
plink_genome_test1,,/Users/myname/Desktop/PLINK/plink_mac/plink_genome_test1,,

If it is split across chromosomes the files should have slightly longer root names but the same sampleset ID. An example for chrs 1 and 2:

sampleset,vcf_path,bfile_path,pfile_path,chrom
plink_genome_test1,,/Users/myname/Desktop/PLINK/plink_mac/plink_genome_test1_chr1,,1
plink_genome_test1,,/Users/myname/Desktop/PLINK/plink_mac/plink_genome_test1_chr2,,2
[...]

olatzu · 2023-01-23T15:24:17Z

Dear @smlmbrt ,

Thanks a lot for the quick response! Now worked! The problem seems now to be that "the Score PGS001927_hmPOS_GRCh37 fails minimum matching threshold (22.01% variants match)" is there any way to overcome that? (already tried changing the Ch38 which is even less..) thanks a lot again!

smlmbrt · 2023-01-23T16:44:28Z

You can adjust the --min_overlap flag to score on the available 22% of variants un the score; however, it's probably best to investigate why the data is missing so many variants. Some options:

You're only using directly measure genotypes from an array and so the variant overlap is low. This can be solved by running imputation to a common reference panel (e.g. https://imputationserver.sph.umich.edu/) and then supplying the output to the calculator.
You're using a VCF that doesn't have many samples or proper reference allele encoding. You can see a discussion of that scenario here: format for input data includes reference calls? #50

olatzu · 2023-01-23T17:10:22Z

Dear @smlmbrt ,

Thank you again very much for the info and the prompt reply! It now seems to have worked. However, I get the error:

Error: --score variant ID '1:89479074:C:T' appears multiple times in main
dataset.

Weird enough I have checked the .bim files and the PGS and the variant is not there(?)

Thank you very much again!

smlmbrt · 2023-01-23T17:14:49Z

This is because the pipeline relabels the variants for consistency and scoring file formatting. If you do a grep/lookup by position in the .bim file you'll likely see multiple rows with those alleles. Implementing the response in this issue may fix this problem: #74 (comment)

smlmbrt added the user-query User queries & requests label Jan 23, 2023

nebfield closed this as completed Jul 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

simplest vs bfile_path SAMPLESHEET #77

simplest vs bfile_path SAMPLESHEET #77

olatzu commented Jan 20, 2023

smlmbrt commented Jan 23, 2023

olatzu commented Jan 23, 2023

smlmbrt commented Jan 23, 2023

olatzu commented Jan 23, 2023

smlmbrt commented Jan 23, 2023 •

edited

Loading

simplest vs bfile_path SAMPLESHEET #77

simplest vs bfile_path SAMPLESHEET #77

Comments

olatzu commented Jan 20, 2023

smlmbrt commented Jan 23, 2023

olatzu commented Jan 23, 2023

smlmbrt commented Jan 23, 2023

olatzu commented Jan 23, 2023

smlmbrt commented Jan 23, 2023 • edited Loading

smlmbrt commented Jan 23, 2023 •

edited

Loading