AnnotateMe

AnnotateMe is an annotation framework for single nucleotide polymorphisms. Can be used to annotate genetic variants to likely affected genes and to perform gene-set overlap analysis of the associated genes.

The program take as input any list of variants along with the input type. Upone run, AnnotateMe will run a functional annotation and enrichment analysis, and will save all intermediate files as well as plots in a folder called RESULTS/.

General information on usability

AnnotateMe source code is freely available online as a web-server at http://snpxplorer.eu.ngrok.io. However, the tool can also be installed on your local machine. Please follow instructions at the bottom of the page to make sure you have all required packages and tools to run AnnotateMe. Keep in mind that when cloning AnnotateMe into your own system, additional annotation sources should be downloaded, including annotation files. For this reason, we recommend to contact us (n.tesi@amsterdamumc.nl) for troubleshooting.

Input data

The format of the input variants for AnnotateMe is not strict: the user can input data in multiple formats (chromosome:position, rsid) and should specify the input data type.

Variant-gene mapping

The first step in AnnotateMe is the variant-gene mapping. This procedure associates each variant with one or multiple genes. This step is one of the most delicate and important: since most of GWAS hits is in non-coding regions, understanding the functional consequences of each variant is not trivial. We linked each variant to its likely affected gene(s) combining annotation from Combined Annotation Dependent Depletion (CADD, v1.3)[1], expression-quantitative-trait-loci in blood (eQTL) from GTEx consortium (v8)[2] and positional mapping up to 500 kb from the reported variants (RefSeq version 98).[3] CADD annotation is used to inspect each variant’s consequences: in case of coding variants (e.g synonymous or missense variant), we confidently associated the variant with the corresponding gene. We consider LD patterns in doing so, thus we check variant consequences of all variants in high LD with the input variants. Alternatively, we first considered possible eQTL associations and in case these are also not available, we included all genes at increasing distance d from the variant (starting with d<=50 kb, up to d<=500 kb, increasing by 50 kb). This procedure allows multiple gene(s) to associate with a single variant, based on annotation uncertainty. We represent variant-mapping annotation with several plots showing the source of annotation of all variants, the number of genes associated with each variant, the distribution of the mapped genes across chromosomes and a circular summary visualization showing the source of variant mapping, variant frequency and chromosomal distribution.

Previous associations of the variants and genes

Given the list of variants and genes associated with the input variants, we first look these variants and genes in the GWAS catalog, a database including, as of June 2020, 4493 publications and 179364 associations in total.[4] In this databse, we sought for previous associations with any trait. Similarly, we also look whether the genes associated with these variants were previously reported to associate with any trait. However, we realized that allowing multiple genes to associate with each variant could result in as enrichment bias, as neighboring genes are often functionally related. To control for this, we implement sampling techniques (1000 iterations): at each iteration, we (i) sample one gene from the pool of genes associated with each variant (thus allowing only 1:1 relationship between variants and genes), and (ii) looked whether the resulting genes were previously reported in the GWAS catalog. Averaging by the number of iterations, we obtain an unbiased estimation of the overlap of the PRS-associated genes with each trait in the GWAS catalog. We represent overlaps with GWAS catalog as barplots, where each bar is representative of a trait as found in the GWAS catalog: this is done both at the level of the single variants and at the level of the genes.

Gene-set enrichment analysis

The final step in the functional annotation procedure is the gene-set overlap analysis, where the idea is to explore the biological processes enriched in the list of genes associated with the input varians. Once again, to avoid enrichement bias due to multiple genes mapping to the same variant, we used sampling techniques: at each iteration, we (i) sample one gene from the pool of genes associated with each variant and (ii) perform gene-set overlap analysis with the resulting list of genes. Gene-set overlap analysis is performed with GOSt function as implemented in R package gprofiler2, with Biological Processes (GO:BP) as background, excluding electionic annotations and correcting p-values using FDR. Finally, we averaged p-values for each enriched term over the iterations (N=1000). To reduce the complexity of the resulting enriched biological processes, we exploited the web-server tool REVIGO.[5] This tool summarizes enrichment results by removing redundant terms based on semantic similarity measure, and displays remaining terms in an embedded space via eigenvalue decomposition of the pairwise distance matrix. We chose Lin as semantic distance measure and allowed small similarity among terms in order to be clustered together.

Packages and tools required to run AnnotateMe locally

AnnotateMe is written in R and therefore require it to be correctly installed and running in your system. You can download and install R from https://www.r-project.org. The packages requires for running a full AnnotateMe instance are: data.table,stringr, parallel, lme4, ggsci, RColorBrewer, gprofiler2, rvest, GOSemSim, GO.db, org.Hs.eg.db, pheatmap, circlize, plotrix, ggplot2, devtools, treemap, basicPlotteR, gwascat, GenomicRanges, rtracklayer, Homo.sapiens, BiocGenerics and liftOver. In addition, AnnotateMe requires PLINK to be installed and executable in your system: see https://www.cog-genomics.org/plink2/. Finally, bash commands are used including cp, cat, rm and sendEmail.

References

[1] Rentzsch,P. et al. (2019) CADD: predicting the deleteriousness of variants throughout the human genome. Nucleic Acids Res., 47, D886–D894.

[2] GTEx Consortium (2013) The Genotype-Tissue Expression (GTEx) project. Nat. Genet., 45, 580–585.

[3] O’Leary,N.A. et al. (2016) Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation. Nucleic Acids Res., 44, D733–745.

[4] Buniello,A. et al. (2019) The NHGRI-EBI GWAS Catalog of published genome-wide association studies, targeted arrays and summary statistics 2019. Nucleic Acids Res., 47, D1005–D1012.

[5] Supek,F. et al. (2011) REVIGO Summarizes and Visualizes Long Lists of Gene Ontology Terms. PLoS ONE, 6, e21800.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Repository files navigation

AnnotateMe

General information on usability

Input data

Variant-gene mapping

Previous associations of the variants and genes

Gene-set enrichment analysis

Packages and tools required to run AnnotateMe locally

References

About

Releases

Packages

TesiNicco/AnnotateMe

Folders and files

Latest commit

History

README.md

README.md

Repository files navigation

AnnotateMe

General information on usability

Input data

Variant-gene mapping

Previous associations of the variants and genes

Gene-set enrichment analysis

Packages and tools required to run AnnotateMe locally

References

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages