Home

When to use this tool

This tool was developed to correct for attenuation bias in specifications that control for a polygenic index (PGI). In particular, you should use this tool if you have constructed or downloaded polygenic indices and are using the PGI in a regression of a phenotype (may or may not correspond to the PGI) on the PGI, other covariate(s), and (potentially) interactions between the PGI and covariates. To see a suitable use case for this tool, we invite you to read Papageorge and Thom (2018). In Becker et al. (2021) we apply this tool to the Papageorge and Thom data.

Required and optional flags

A very useful way to understand the proper use of this tool is to read the required and optional input arguments. To do this, first cd into the directory containing the repository. Then, type

python3 pgic.py -h

In your terminal, you should now see something like

usage: pgic.py [-h] --reg-data-file FILE_PATH --outcome COLUMN_NAME --pgi-var
               COLUMN_NAME [--pgi-pheno-var COLUMN_NAME]
               [--pgi-interact-vars [COLUMN_NAME [COLUMN_NAME ...]]]
               --covariates COLUMN_NAME [COLUMN_NAME ...]
               [--weights [COLUMN_NAME]] [--out FILE_PREFIX]
               [--output-vars [COLUMN_NAME [COLUMN_NAME ...]]] [--h2 PARAM]
               [--R2 PARAM] [--grm-cutoff PARAM]
               [--gcta-exec FILE_PATH | --bolt-exec FILE_PATH]
               [--download-gcta | --download-bolt] [--grm FILE_PREFIX]
               [--bfile FILE_PREFIX] [--pheno-file FILE_PATH]
               [--pheno-file-pheno-col COLUMN_NAME] [--jk-se]
               [--num-blocks NUM_BLOCKS]
               [--id-col [COLUMN_NAME [COLUMN_NAME ...]]] [--force]
               [--logging-level LOGGING_LEVEL] [--num-threads NUM_THREADS]

Any unbracketed flag is required for any specification. While bracketed flags are not required, certain combinations of arguments may necessitate various bracketed flags, as explained below. Note that you can also scroll down in your terminal to read detailed descriptions of each flag and how to properly invoke them.

Minimum functionality

The simpliest call, in terms of functionality, will resemble the following:

python3 pgic.py --reg-data-file test/data/reg_data.txt \
                --outcome "PHENO" \
                --pgi-var "PGI" \
                --covariates "PC*" \
                --pgi-interact-vars "PC1" "PC2" "PC3"
                --R2 0.1 \
                --h2 0.2 \
                --out correction

This call represents minimum functionality because no parameters need to be estimated prior to applying the linear transformation. Description of flags:

--reg-data-file: Path to table containing the data required for the regression. Any delimiter is acceptable for this table
--outcome: The column in --reg-data-file corresponding to the outcome, or dependent variable, in the regression.
--pgi-var: The column in --reg-data-file corresponding to the PGI.
--covariates: Control variables in --reg-data-file to be included in the specification. Note that you can invoke wildcarding (either "*" for one or more characters or "?" for one character) to simplify this invocation.
--pgi-interact-vars: Interaction(s) between the PGI and covariate(s). Note that you do not generate these interaction terms in preprocessing. We generate them for you by multiplying --pgi-var with the variables passed in for this flag.
--R2: R² from a regression of the phenotype (corresponding to the PGI) on the PGI and a constant. We can estimate this for you, but you can also specify it.
--h2: Estimated heritability. This may come from the literature, an estimate from GCTA, or another prior.
--out: Output path to put results and log file. Do not include the suffix for this argument.

Heritability estimation

A key parameter for the correction is --h2, or the heritability of the phenotype corresponding to the PGI. In the event that the user does not want to specify this value, the software will estimate it with GCTA. This can be done by simply excluding --h2 from the call. If this flag is not included, other arguments may be useful to include (optional arguments bracketed):

--bfile: Genotype data in plink format, .bed/.bim/fam. Do not include suffixes in the call.
--pheno-file: GCTA .phen file. This must be a whitespace delimited file, with the first column set to IID, second column set to FID, and third column set to the phenotype value (1/0 for case/control, -9 for missing). The table cannot have a header.
[--grm]: Genomic relatedness matrix, constructed from the specified --bfile data. It is recommended to specify this if the user already has it -- its construction can be very slow. Do not include the .grm.bin, .gr.ID., .grm.N.bin suffixes.
[--gcta-exec]: Path to gcta64 executable. If not specified we will download it for you. This executable must be version 1.93.0beta released December 9, 2019.
[--grm-cutoff]: The heritability estimate will be most accurate if related individuals are excluded. This flag, from GCTA, will remove related individuals above the specified threshold, maximizing sample size.

Jack knife standard errors

The correction uses the point estimates of --h2 and --R2, but the user may want to see the standard errors on one or both of these parameters as well. This can be done by passing --jk-se to the function. If this option is specified, other flags will be required or encouraged:

--id-col: This is the person level ID in --reg-data-file. This is used to assign people to jack-knife iterations.
[--num-blocks]: Number of jack-knife iterations, to specify a value other than the default. More iterations will produce more reliable estimates.

Provide feedback

Saved searches