This repo contains the data and analysis to produce the TargetAge set of 995 targets with shared genetic links to age-related diseases and traits described in West et al., Integrative analysis of GWAS and co-localisation data suggests novel genes associated with age-related multimorbidity, medRxiv 2022.11.11.22282236 (https://doi.org/10.1101/2022.11.11.22282236).
The data can be explored in our TargetAge web application, and the code for is in a separate repo.
The data_deposit
directory contains a simplified version of the input and output data:
full_disease_list.csv
- EFO codes for all age-related diseases and traits used to retrieve GWAS studiesGWAS_accession_codes.csv
- GWAS accession codes for all GWAS studies used in the analysisgenetics_table.Rda
- genetic data: details of each associated locus, lead variant, genomic positions, significance and effect size, L2G score for the most causal gene, and corresponding cluster information.
This analysis of age-related disease targets uses Open Targets Platform and Open Targets Genetics.
Open Targets Platform [1] aggregates data from a large range of sources, using "evidence" to make connections between "target" entities (a protein, protein complex, or RNA molecule) and "disease" entities (EFO terms). The evidence linking a target and disease is summarised as a target-disease association, with a score ranging from 0-1 to assist with prioritisation. Targets, diseases, and evidence are comprehensively annotated.
Open Targets Genetics [2] aggregates human GWAS and functional genetics data in a variant-centric manner to enable large-scale exploration and prioritisation of potential causal variants and genes. It provides a disease-agnostic Variant to Gene (V2G) mapping and a disease-specific Locus to Gene (L2G) mapping for trait-associated loci, with both using a score ranging from 0-1. It also enables systematic comparison between studies, providing the number of shared independently-associated loci between studies, and performing pairwise colocalisation analysis.
Open Targets Genetics expands each lead variant (the variant with the most significant p-value) to include tag variants, representing a more complete set of potentially causal variants at a trait-associated locus. Where summary statistics are available, this expansion is performed using fine-mapping and credible-set analysis. Where only the lead variant is reported per locus, expansion is performed using Linkage Disequilibium (LD) analysis, with the 1000 Genomes Phase 3 (1KG) haplotype panel as a reference population.
This script:
- Reads in the data from Open Targets Platform, Open Targets Genetics, and the ARD list
- Expands the ARDs to include all ontological descendant terms, removing excluded therapeutic areas
- Gets all GWAS evidence and target annotations for ARDs
- Gets all lead variants, summary statistics, and credible sets from these GWAS
- Generates edges between GWAS hits with potentially shared causal variants
- Saves output files in
data/targetage
disease_list.csv
- A curated list of age-related diseases (ARDs), and the corresponding EFO-codes. Commented lines are ignored.
diseases
- annotation information for disease entitiestargets
- annotation information for target entititesevidences
- all evidences used to make associations between targets and diseasesevidences/succeeded/sourceId\=ot_genetics_portal/
- evidence taken from the genetics portal (human genome-wide association data) knownDrugs - evidence from drugs with a known mechanism of action (MOA) and indication, that links a disease to a target
v2d_coloc
- colocalisation analysis result for each variant+studyId (GWAS-GWAS and GWAS-xQTL)
First, get the disease annotations for the ARDs (ardiseases
), including the EFO label (diseaseName
), Therapeutic Areas, and description.
Open Targets (optionally) propagates evidence from specific, lower-level EFO terms, to support associations for more general, higher-level EFO terms. For example, evidence to support associations for osteoarthritis
will include evidence linked to its child term, osteoarthritis, knee
.
Get EFO terms and names (specificDiseaseId
and specificDiseaseName
) for each ARD and its descendant terms. We will use this to retrieve the evidence supporting the associations, but we will also need this information later to decide whether the evidence is relevant for our purposes.
overall_associations
is all target-disease associations for our ARDs, considering all data sources (including genetic, literature, drugs, and other types of evidence).
gen_associations
is all target-disease associations based on genetic evidence.
ard_targets
is all the target annotations for targets linked to the ARDs through genetic evidence.
otg_evidences
is all the evidence from Open Targets Genetics (i.e. human genome-wide association evidence)
ard_otg_evidences
is all the Open Targets Genetics evidence for our ARDs. Each "evidence" is a genome-wide significant trait-associated locus mapped to a target (gene) with an L2G score of at least 0.05. Note that a single trait-associated locus may be mapped to more than one target, so an individual GWAS associations may be represented more than once as evidence for different targets.
ard_studies
is all the unique GWAS studies (studyId
) that we are interested in for our ARDs, the trait studied (trait_reported
), the ARD (diseaseName
) and specific disease (specificDiseaseName
) to which this trait was mapped, and whether the study has summary statistics available (has_sum_stats
).
We are interested in the amount of genetic overlap between different ARDs.
In order to determine how many genetic signals are shared between ARDs, we have to establish the number of independent trait-associated signals for each ARD (i.e. the overlap between GWAS studies for an ARD).
As we don't know the causal variant for each trait-associated locus, we have to use approximations to decide whether the associations from two different GWAS studies represent the same signal. To do this, we combine two approaches from Open Targets Genetics:
- Colocalisation
Where study 1 and study 2 both have summary statistics available, they can be compared using colocalisation analysis. For each loci, this method integrates over evidence from all variants in each study to evaluate which of these four hypothesis is most likely: no association with either trait (H0), association only with trait 1 (H1), association only with trait 2 (H2), association with both traits via two independent SNPs (H3), or association with both traits through a shared causal SNP (H4). We use a cut-off of H4>0.8 to define the association for study 1 and study 2 as sharing the same causal variant. Note that this methodology is used to compare two disease traits, or to compare a disease study with a molecular trait (e.g. eQTL or pQTL).
- Tag variant overlap
Where only the lead variants (not full summary statistics) are available for one or both studies, colocalisation analysis cannot be performed, and tag variant overlap is used instead. A trait-associated lead variant, SNP1, is considered to be part of the same signal as another lead variant, SNP2, if they are within 5MB of each other and any of the LD-defined tag variants are shared.
Output files saved from this analysis:
full_disease_list.csv
- all ARDs including ontology descendentstargetage/ard_v2d.parquet
- V2Dtargetage/ard_associations.parquet
- OT Platform association information for all ARD genestargetage/ard_annotations.parquet
- OT Platform annotations for all ARD genestargetage/coloc_ard_leads.parquet
- edges with colocalisation evidencetargetage/overlap_ard_leads.parquet
- edges with overlapping tag variantstargetage/ard_leads.parquet
- all lead variants from ARD GWAS
Functions for this script are in targetage_analysis_functions.R
NOTE: Use default_overwrite=TRUE
to read in preprepared data, rather than regenerating and overwriting data.
This script:
- Reads and processes the output of
spark_targetage.py
(enforcing minimum GWAS study size) - Performs a graph analysis: a. Combines colocalisation and tag variant overlap edges to create a graph of each ARD individually and a graph of all ARDs b. Calculates the pairwise overlaps between ARDs and generates a heatmap c. Gets the most likely causal gene for each cluster via OT Genetics L2G d. Gets the genes associated with multimorbidity clusters (TargetAge Gene Set)
- Summarises GWAS used, including number of studies and ancestry information
- Processes annotations and tractability information a. Overlap with GenAge, CellAge, and Ageing Hallmarks b. Enrichment for Ageing mechanisms c. Tractability assessmebt
analysis/ard_leads_filtered.Rda
- ARD lead variants (filtered for GWAS size)analysis/overlaps_within.Rda
- overlaps between ARD GWAS hitsanalysis/overlaps_within.Rda
- overlaps with non-ARD traitsanalysis/qtl_coloc
- ARD hit colocalisations with xQTL studiesanalysis/graph_all_morbidities.Rda
- graph with all ARDsanalysis/individual_disease_graphs.Rda
- graphs for each ARD individuallyanalysis/individual_disease_n_communities.Rda
- number of communities in clusters in individual ARD graphsanalysis/ltg_all.Rda
- L2G mappings for for all nodes (lead variants) in the graphanalysis/l2g_qtl_coloc.Rda
- L2G mappings for all xQTLs that colocalise with ARD hitsanalysis/top_l2g.Rda
- top gene by L2G per lead variantanalysis/targetage_geneids.Rda
- ENSEMBL IDs of TargetAge genesanalysis/targetage_annotations.Rda
- Open Targets annotations for all TargetAge genesanalysis/diseases_with_associations.Rda
- list of diseases with GWAS associations
This script generates the data required for the TargetAge appplication.
data/analysis/target_annotations.Rda
- TargetAge annotations generated by thetargetage_analysis.R
script.ltg_all.Rda
data/analysis/graph_all_morbidities.Rda
data/analysis/ard_leads_filtered.Rda
data/analysis/top_l2g.Rda
[1] Ochoa et al., Open Targets Platform: supporting systematic drug–target identification and prioritisation, Nucleic Acids Research (2021) https://doi.org/10.1093/nar/gkaa1027 ↩
[2] Ghoussaini et al., Open Targets Genetics: systematic identification of trait-associated genes using large-scale genetics and functional genomics, Nucleic Acids Research (2021) https://doi.org/10.1093/nar/gkaa840. ↩