KnockoffGWAS

A powerful and versatile statistical method for the analysis of genome-wide association data with population structure. This method localizes causal variants while controlling the false discovery rate, and is valid even if the samples have diverse ancestries and familial relatedness.

Accompanying paper:

False discovery rate control in genome-wide association studies with population structure
M. Sesia, S. Bates, E. Candès, J. Marchini, C. Sabatti
Proceedings of the National Academy of Sciences (2021) https://www.pnas.org/content/118/40/e2105841118

For more information, visit: https://msesia.github.io/knockoffgwas.

For an earlier version of this method restricted to homogeneous populations, see also KnockoffZoom.

Overview

The goal of KnockoffGWAS is to identify causal variants for complex traits effectively and precisely through genome-wide fine-mapping, accounting for linkage disequilibrium and controlling the false discovery rate. The results leverage the genetic models used for phasing and are equally valid for quantitative and binary traits. The main innovation KnockoffGWAS is to support the analysis of diverse populations, with different ancestries and possibly close familial relatedness. Furthermore, KnockoffGWAS includes a highly efficient standalone C++ program for generating genetic knockoffs for large data sets, which facilitates applications compared to KnockoffZoom.

The code contained in this repository is designed to allow the application of KnockoffGWAS to large datasets, such as the UK Biobank. Some of the code is provided in the form of Bash and R scripts, while the core algorithms for Monte Carlo knockoff sampling are implemented in C++.

The KnockoffGWAS methodology is divided into different modules, each corresponding to a separate Bash script contained in the directory knockoffgwas/.

Dependencies

Recommended OS: Linux. Mac OS is not supported but should be compatible.

The following software should be available from your user path:

PLINK 1.9

The following R (version 4.0.2) packages are required:

fastcluster 1.1.25
bigsnpr 1.4.4
bigstatsr 1.2.3
tidyverse 1.3.0
latex2exp 0.5.0
gridExtra 2.3

The above version numbers correspond to the configuration on which this software was tested. Newer version are likely to be compatible, but have not been tested.

Installation

Clone this repository on your system and install any missing dependencies. Estimated installation time (dependencies): 5-15 minutes. Compile the C++ program for knockoff generation by entering the directory snpknock2 and running make.

Toy dataset and tutorial

A toy dataset containing 1000 artificial samples typed at 2000 loci (divided between chromosome 21 and 22) is offered as an example to test KnockoffGWAS. To run the example, simply execute the script analyze.sh.

./analyze.sh

This script will also verify whether required R packages are available and install them otherwise.

The analysis should take less than 5 minutes on a personal computer. The results can be visualized interactively with the script visualize.sh, which will launch a Shiny app in your browser. Some additional R packages are required by the visualization tool, and will be automatically installed if not found.

./visualize.sh

The expected results for the analysis of this toy dataset are provided in the directory results/ and can be visualized by running the script visualize.sh before running analyze.sh. Note that the script analyze.sh will overwrite the default results.

See https://msesia.github.io/knockoffgwas/tutorial.html for a more detailed tutorial.

Large-scale applications

KnockoffGWAS is computationally efficient and we have successfully applied it to the analysis of the genetic data in the UK Biobank. For more information, visit https://msesia.github.io/knockoffgwas/ukbiobank.html. The analysis of large datasets cannot be carried out on a personal computer. The computational resources required for the analysis of the UK Biobank data are summarized in the accompanying paper.

The modular nature of our method allows the code contained in each of the 4 main scripts to be easily deployed on a computing cluster for large-scale applications. This task will require some additional user effort compared to the toy example, but the scripts for each module are documented and quite intuitive.

Authors

Matteo Sesia (University of Southern California).

Contributors

Stephen Bates

License

This software is distributed under the GPLv3 license.

Name		Name	Last commit message	Last commit date
Latest commit History 38 Commits
data		data
examples		examples
knockoffgwas		knockoffgwas
misc		misc
results		results
snpknock2		snpknock2
visualization		visualization
.gitignore		.gitignore
.travis.yml		.travis.yml
README.md		README.md
analyze.sh		analyze.sh
visualize.sh		visualize.sh

msesia/knockoffgwas

Folders and files

Latest commit

History

Repository files navigation

KnockoffGWAS

Overview

Dependencies

Installation

Toy dataset and tutorial

Large-scale applications

Authors

Contributors

License

Further references

About

Topics

Resources

Stars

Watchers

Forks

Languages