Skip to content

MichelNivard/awesome-complex-trait-genetics

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

75 Commits
 
 
 
 
 
 

Repository files navigation

awesome-complex-trait-genetics

Awesome

A list of awesome tools for human complex trait genetics, biased/slanted towards post-GWAS analysis.

How can you add your tool? via a pull request, if you dont know what that is, read the contributing guidlines: (https://github.com/MichelNivard/awesome-genetics/blob/main/contributing.md)

This list may for the basis for a static review, in which case serious contributors, so those who contribute to the list and to the preamle paragraphs for each section, will be ivited as collaborators.

genetic architecture

Sumstats based:

  • LDSC ldsc is a python command line tool for estimating heritability and genetic correlation from GWAS summary statistics. ldsc also computes LD Scores. A Python3 port is available.

  • LDSR LDSC implemented in R. Several quality of life features that simplifies the user interface to LDSC, and makes it dirt simple to run heritability estimates, genetic correlations and partitioned heritability from in-memory summary statistics.

  • GCTB GCTB is a software tool that comprises a family of Bayesian linear mixed models for complex trait analyses using genome-wide SNPs. It was developed to simultaneously estimate the joint effects of all SNPs and the genetic architecture parameters for a complex trait. There are now extensions to estimate the same bayesian linear model parameters based on summary data.

Raw data based:

  • GCTA GCTA (Genome-wide Complex Trait Analysis) is a software package initially developed to estimate the proportion of phenotypic variance explained by all genome-wide SNPs for a complex trait.

  • RHE-mc/GENIE RHE-mc is a method to estimate the proportion of phenotypic variance explained by SNPs, and GENIE extends this to model GxE effects

  • BOLT-LMM/BOLT-REML The BOLT-LMM software package currently consists of two main algorithms, the BOLT-LMM algorithm for mixed model association testing, and the BOLT-REML algorithm for variance components analysis (i.e., partitioning of SNP-heritability and estimation of genetic correlations).

  • LDAK/SumHer/PCGC LDAK is a software package for analysing association study data. PCGC (phenotype-correlation genotype-correlation) Regression is an alternative to REML when estimating heritability for binary traits (i.e., diseases).

Univariate models (heritability/poligenicity/stratified/geneset enrichment etc)

  • i-LDSC interaction-LD score (i-LDSC) regression: Model an additional score that measures the amount of non-additive genetic variation that is tagged by each variant in the data.

  • ACLR Autocorrelation LD regression: a tool to efficiently estimate the autocorrelation of latent effects in large genetic data sets. (WARNING Python 2.7 code)

  • HAMSTA HAMSTA is a python package that estimate heritability explained by local ancestry using summary statistics from admixture mapping studies.

  • MAGMA MAGMA: Generalized gene-set analysis of GWAS data.

  • MiXeR Causal mixture model (MiXeR) is a tool designed to estimate the polygenic architecture of a single trait, including the total number of causal variants and the distribution of their effect sizes.

Genetic correlation (LD score derivatives/extensions)

  • HDL High-Definition Likelihood (HDL) is a likelihood-based method for estimating genetic correlation using GWAS summary statistics. Compared to LD Score regression (LDSC), It reduces the variance of a genetic correlation estimate by about 60%.

Stratified/local genetic correlatons

  • LAVA LAVA (Local Analysis of [co]Variant Association) is a tool developed for local genetic correlation (rg) analysis.

Ancestry aware Genetic correlations:

  • s-ldxr S-LDXR is a method to stratify squared trans-ethnic genetic correlation by genomic annotations from GWAS summary statistics.

  • Popcorn Popcorn is a program for estimaing the correlation of causal variant effect sizes across populations in GWAS.

  • mama mama is a Python-based command line tool that meta-analyzes GWAS summary statistics generated from distinct ancestry groups.

Model trait relationships beyond correlation

Genetic SEM/Factor models

sumstat based:

  • GenomicSEM R package for Genomic Structural Equation Models. user defined models of the relation between complex traits based on GWAS summary data.

  • GUIDE Genetic Unmixing by Independent Decomposition (GUIDE), uses ICA to estimate statistically independent latent factors that best express the patterns of association across many traits.

  • FactorGO FactorGo is a scalable variational factor analysis model that learns pleiotropic factors using GWAS summary statistics!

  • GNA GNA is an R package for performing network analysis of genetic overlap derived from GWAS summary statistics

  • partialLDSC is an R-package to estimate partial genetic correlations from GWAS summary statistics, and compare them to their unadjusted counterparts, to quantify the contribution of a given confounder in explaining genetic similarity between conditions.

these tools also give you the % of SNPs that is pleiotropic:

  • bivarite MiXeR Bivariate causal mixture model (MiXeR) is a tool designed to estimate the degree of genetic overlap and shared causal variants between two traits.

  • trivariate MiXeR Trivariate causal mixture model (MiXeR) is a tool designed to estimate the degree of genetic overlap and shared causal variants between three traits.

Raw data based:

  • Gsens: Genetically informed sensitivity analysis to estimate role of genetic confounding in phenotypic associations and for causal inference from observation data.

Two sample Mendelian Randomisation

  • TwoSampleMR An R package for performing Mendelian randomization using GWAS summary data.

MR/Genetic architecture hybrid models

  • lhcMR lhcMR is an R package that performs bi-directional causal estimation between a pair of traits, while accounting for the presence of a potential heritable confounder acting on the pair.

  • CAUSE an MR method, Causal Analysis Using Summary Effect Estimates (CAUSE), that accounts for correlated and uncorrelated horizontal pleiotropic effects.

  • LCV LCV is an LD score based method for inferring genetically causal relationships using GWAS data.

Mendelian randomization in cis

  • MR-link-2 cis Mendelian randomization that is more robust to violations of the pleiotropy assumption from a single region

Colocalisation/finemapping of causal variants

  • coloc R package to perform genetic colocalisation analysis, test whether two related phenotypes share common genetic causal variant(s) in a given region.

  • fastENLOC This repository contains the software implementation of FastENLOC, which enables integrative genetic association analysis of molecular QTL data and GWAS data.

  • SharePro This repository contains the software implementation of SharePro, which uses an effect group-level approach to integrate LD modeling and colocalization assessment to account for multiple causal variants in colocalization analysis.

  • FINEMAP FINEMAP: Efficient variable selection using summary data from genome-wide association studies.

  • polyfun Package contains PolyFun for functionally-informed fine-mapping, PolyLoc for polygenic localization of complex trait heritability.

  • OPERA: (omics pleiotropic association) method tests combinatorial pleiotropic associations between multiple molecular phenotypes (eQTL, DNAm) with a complex trait using summary-level data.

  • SuSiEx: Cross-population finemapping using summary statistics and LD reference panels.

gene-level analysis (TWAS)

  • FUSION FUSION is a suite of tools for performing transcriptome-wide and regulome-wide association studies (TWAS and RWAS).

  • FOCUS FOCUS (Fine-mapping Of CaUsal gene Sets) is software to fine-map transcriptome-wide association study statistics at genomic risk regions

  • cTWAS Causal-TWAS (cTWAS) jointly models genetically regulated expression and the direct effects of genetic variants on the phenotype

Simulation

  • GWASBrewer A flexible tool for simulating realistic GWAS summary statistics for one, or many, traits.

  • magenpy: A python-based library that provides utilities for simulating complex traits with various genetic architectures.

Genomic data wrangling

  • HAIL Hail is an open-source, general-purpose, Python-based data analysis tool with additional data types and methods for working with genomic data.

  • bigsnpr R package for the analysis of massive SNP arrays, primarily designed for human genetics.

  • ukbrapR ukbrapR (phonetically: 'U-K-B-wrapper') is an R package for working in the UK Biobank Research Analysis Platform (RAP). The aim is to make it quicker, easier, and more reproducible.

  • MungeSumstats R package designed to facilitate the standardisation of GWAS summary statistics.

  • tidyGWAS R package that harmonises GWAS data, repairs/imputes missing columns, removing a minimal amount of rows.

  • gwasRtools R package to (1) identify loci and independent lead SNPs (using online or local reference panel) and (2) annotate variants with nearest gene from GENCODE database.

  • qgg & gact qgg provides an infrastructure for efficient processing of large-scale genetic and phenotypic data while gact is designed for establishing and populating a comprehensive database focused on genomic associations with complex traits, provies R implementations of popular follow up analysis (LDscore regresison, MAGMA, VEGAS, PoPS, etc).

  • bcftools bcftools is a tool for creating, editing, and manipulating VCF and BCF files

  • GenomicRanges An R packages for working with genomic intervals

  • bedtools Collectively, the bedtools utilities are a swiss-army knife of tools for a wide-range of genomics analysis tasks. A very fast and easy way to intersect, merge, count, complement, and shuffle genomic intervals from multiple files in widely-used genomic file formats such as BAM, BED, GFF/GTF, VCF

  • giggle GIGGLE indexes several BED/VCFs to perform efficient genomic interval searches across all files in the index

  • magenpy: A python-based library that provides utilities for interfacing with genotype data (BED format), computing large-scale LD matrices, harmonizing various GWAS data sources (genotypes, LD, sumstats, annotations), and simulating complex traits.

Polygenic scores

  • PRSice PRSice (pronounced 'precise') is a Polygenic Risk Score software for calculating, applying, evaluating and plotting the results of polygenic risk scores (PRS) analyses.

  • LDpred2 LDpred-2 is one of the dedicated PRS programs which is an R package that uses a Bayesian approach to polygenic risk scoring.

  • GCTB GCTB is a software tool that comprises a family of Bayesian linear mixed models for complex trait analyses using genome-wide SNPs. It was developed to simultaneously estimate the joint effects of all SNPs and the genetic architecture parameters for a complex trait.

  • LDpred-funct LDpred-funct is an extension of LDpred that uses functional annotations of the SNPs to modify the prior.

  • DPR DPR is a Bayesian PRS method that uses a non-parametric dirichlet process prior

  • VIPRS VIPRS is a fast Bayesian PRS method that uses Variational Inference techniques to efficiently approximate the posterior for the effect sizes. The python package provides utilities for performing inference as well as computing polygenic scores and common evaluation metrics used in PRS settings.

Pharmacogenomics

  • PharmCAT Pharmacogenomics clinical annotation tool, a caller for pharmacogene alleles and their corresponding metaboliser phenotypes. Originally intended for clinical applications and single samples, but can be scaled to datasets of an arbitrary size.

  • PGxPOP Population-scale pharmacogenetic allele and phenotype caller. Allele definitions are based on a 2020 release of the PharmVar database and might need updating.

  • PyPGx Pharmacogenomic allele and phenotype caller that can be used on various genomic datasets, including next-generation sequencing (NGS), single nucleotide polymorphism (SNP) arrays, and long-read sequencing projects. It can detect and account structural variants if used on NGS data. It supports 87 pharmacogenes, but not all have genotype/phenotype mappings.

GWAS result repositories (preferably with an API)

  • ieugwasr The OpenGWAS database comprises over 50,000 curated, QC'd and harmonised complete GWAS summary datasets and can be queried using an API. See here for documentation on the API itself. This R package is a wrapper to make generic calls to the API.

  • GWAScatalog API As of 2024-11-20, the GWAS Catalog contains 7083 publications, 692444 top associations and 96947 full summary statistics. GWAS Catalog data is currently mapped to Genome Assembly GRCh38.p14 and dbSNP Build 156.

  • GWAS atlas This atlas is a database of publicly available GWAS summary statistics. Each GWAS can be browsed with the manhattan plot, risk loci, MAGMA (i.e. gene-based) results, SNP heritability and genetic correlations with other GWAS in the database.

  • S4 programs S4 programs to calculate PRSs and evaluate them.

Mendelian randomization result repositories

  • Multi-ancestry Proteome-Phenome MR atlas Multi-ancestry atlas of protein-phenotype associations in European, African, and East Asian ancestries using MR and colocalization.

  • Proteome-phenome atlas Atlas of protein-phenotype associations in the UK Biobank (2024).

  • Proteome MR Mendelian Randomization and sensitivity analyses results for 1,545 proteins on 8 complex diseases in European and African ancestries (2022).

  • Proteome MR atlas Mendelian Randomization and sensitivity analyses results for 989 proteins and 225 traits in Europeans (2020).

Online tools

  • gnomAD browser The gnomAD (genome aggregation database) browser is an online tool for querying the gnomAD and ExAC population data

  • Open Targets Platform The Open Targets Platform is a comprehensive tool that supports systematic identification and prioritisation of potential therapeutic drug targets.

  • genebass Genebass is a resource of exome-based association statistics, made available to the public. The dataset encompasses 4,529 phenotypes with gene-based and single-variant testing across 394,841 individuals with exome sequence data from the UK Biobank.

  • All by All The All by All browser maps known and novel associations between genotypes and phenotypes using data contributed by All of Us Research Program participants as of July 1, 2022. All by All encompasses about 3,400 phenotypes with gene-based and single-variant associations across nearly 250,000 whole genome sequences

  • Bravo Variant browser from the Trans-Omics for Precision Medicine (TOPMed) consortium

  • Regeneron Exome browser Variant browser from Regeneron from ~983K individuals

About

A list of awesome tools for complex trait genetics.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published