targetage-pipeline

This repo contains the data and analysis to produce the TargetAge set of 995 targets with shared genetic links to age-related diseases and traits described in West et al., Integrative analysis of GWAS and co-localisation data suggests novel genes associated with age-related multimorbidity, medRxiv 2022.11.11.22282236 (https://doi.org/10.1101/2022.11.11.22282236).

The data can be explored in our TargetAge web application, and the code for is in a separate repo.

The data_deposit directory contains a simplified version of the input and output data:

full_disease_list.csv - EFO codes for all age-related diseases and traits used to retrieve GWAS studies
GWAS_accession_codes.csv - GWAS accession codes for all GWAS studies used in the analysis
genetics_table.Rda - genetic data: details of each associated locus, lead variant, genomic positions, significance and effect size, L2G score for the most causal gene, and corresponding cluster information.

This analysis of age-related disease targets uses Open Targets Platform and Open Targets Genetics.

Open Targets Platform [1] aggregates data from a large range of sources, using "evidence" to make connections between "target" entities (a protein, protein complex, or RNA molecule) and "disease" entities (EFO terms). The evidence linking a target and disease is summarised as a target-disease association, with a score ranging from 0-1 to assist with prioritisation. Targets, diseases, and evidence are comprehensively annotated.

Open Targets Genetics [2] aggregates human GWAS and functional genetics data in a variant-centric manner to enable large-scale exploration and prioritisation of potential causal variants and genes. It provides a disease-agnostic Variant to Gene (V2G) mapping and a disease-specific Locus to Gene (L2G) mapping for trait-associated loci, with both using a score ranging from 0-1. It also enables systematic comparison between studies, providing the number of shared independently-associated loci between studies, and performing pairwise colocalisation analysis.

Open Targets Genetics expands each lead variant (the variant with the most significant p-value) to include tag variants, representing a more complete set of potentially causal variants at a trait-associated locus. Where summary statistics are available, this expansion is performed using fine-mapping and credible-set analysis. Where only the lead variant is reported per locus, expansion is performed using Linkage Disequilibium (LD) analysis, with the 1000 Genomes Phase 3 (1KG) haplotype panel as a reference population.

Step 1: spark_targetage.py

This script:

Reads in the data from Open Targets Platform, Open Targets Genetics, and the ARD list
Expands the ARDs to include all ontological descendant terms, removing excluded therapeutic areas
Gets all GWAS evidence and target annotations for ARDs
Gets all lead variants, summary statistics, and credible sets from these GWAS
Generates edges between GWAS hits with potentially shared causal variants
Saves output files in data/targetage

Data

disease_list.csv - A curated list of age-related diseases (ARDs), and the corresponding EFO-codes. Commented lines are ignored.

OT_platform (parquet format)

diseases - annotation information for disease entities
targets - annotation information for target entitites
evidences - all evidences used to make associations between targets and diseases
evidences/succeeded/sourceId\=ot_genetics_portal/ - evidence taken from the genetics portal (human genome-wide association data) knownDrugs - evidence from drugs with a known mechanism of action (MOA) and indication, that links a disease to a target

OT_genetics (json format)

v2d_coloc - colocalisation analysis result for each variant+studyId (GWAS-GWAS and GWAS-xQTL)

Analysis

Expansion to specific diseases

First, get the disease annotations for the ARDs (ardiseases), including the EFO label (diseaseName), Therapeutic Areas, and description.

Open Targets (optionally) propagates evidence from specific, lower-level EFO terms, to support associations for more general, higher-level EFO terms. For example, evidence to support associations for osteoarthritis will include evidence linked to its child term, osteoarthritis, knee.

Get EFO terms and names (specificDiseaseId and specificDiseaseName) for each ARD and its descendant terms. We will use this to retrieve the evidence supporting the associations, but we will also need this information later to decide whether the evidence is relevant for our purposes.

Get target-disease associations and associated target annotations

overall_associations is all target-disease associations for our ARDs, considering all data sources (including genetic, literature, drugs, and other types of evidence).

gen_associations is all target-disease associations based on genetic evidence.

ard_targets is all the target annotations for targets linked to the ARDs through genetic evidence.

Get Open Target Genetics evidence

otg_evidences is all the evidence from Open Targets Genetics (i.e. human genome-wide association evidence)

ard_otg_evidences is all the Open Targets Genetics evidence for our ARDs. Each "evidence" is a genome-wide significant trait-associated locus mapped to a target (gene) with an L2G score of at least 0.05. Note that a single trait-associated locus may be mapped to more than one target, so an individual GWAS associations may be represented more than once as evidence for different targets.

ard_studies is all the unique GWAS studies (studyId) that we are interested in for our ARDs, the trait studied (trait_reported), the ARD (diseaseName) and specific disease (specificDiseaseName) to which this trait was mapped, and whether the study has summary statistics available (has_sum_stats).

Establishing independent genetic signals

We are interested in the amount of genetic overlap between different ARDs.

In order to determine how many genetic signals are shared between ARDs, we have to establish the number of independent trait-associated signals for each ARD (i.e. the overlap between GWAS studies for an ARD).

As we don't know the causal variant for each trait-associated locus, we have to use approximations to decide whether the associations from two different GWAS studies represent the same signal. To do this, we combine two approaches from Open Targets Genetics:

Colocalisation

Where study 1 and study 2 both have summary statistics available, they can be compared using colocalisation analysis. For each loci, this method integrates over evidence from all variants in each study to evaluate which of these four hypothesis is most likely: no association with either trait (H0), association only with trait 1 (H1), association only with trait 2 (H2), association with both traits via two independent SNPs (H3), or association with both traits through a shared causal SNP (H4). We use a cut-off of H4>0.8 to define the association for study 1 and study 2 as sharing the same causal variant. Note that this methodology is used to compare two disease traits, or to compare a disease study with a molecular trait (e.g. eQTL or pQTL).

Tag variant overlap

Where only the lead variants (not full summary statistics) are available for one or both studies, colocalisation analysis cannot be performed, and tag variant overlap is used instead. A trait-associated lead variant, SNP1, is considered to be part of the same signal as another lead variant, SNP2, if they are within 5MB of each other and any of the LD-defined tag variants are shared.

Output files

Save output

Output files saved from this analysis:

full_disease_list.csv - all ARDs including ontology descendents
targetage/ard_v2d.parquet - V2D
targetage/ard_associations.parquet - OT Platform association information for all ARD genes
targetage/ard_annotations.parquet - OT Platform annotations for all ARD genes
targetage/coloc_ard_leads.parquet - edges with colocalisation evidence
targetage/overlap_ard_leads.parquet - edges with overlapping tag variants
targetage/ard_leads.parquet - all lead variants from ARD GWAS

Step 2: targetage_analysis.R

Functions for this script are in targetage_analysis_functions.R

NOTE: Use default_overwrite=TRUE to read in preprepared data, rather than regenerating and overwriting data.

This script:

Reads and processes the output of spark_targetage.py (enforcing minimum GWAS study size)
Performs a graph analysis: a. Combines colocalisation and tag variant overlap edges to create a graph of each ARD individually and a graph of all ARDs b. Calculates the pairwise overlaps between ARDs and generates a heatmap c. Gets the most likely causal gene for each cluster via OT Genetics L2G d. Gets the genes associated with multimorbidity clusters (TargetAge Gene Set)
Summarises GWAS used, including number of studies and ancestry information
Processes annotations and tractability information a. Overlap with GenAge, CellAge, and Ageing Hallmarks b. Enrichment for Ageing mechanisms c. Tractability assessmebt

Intermediary files

analysis/ard_leads_filtered.Rda - ARD lead variants (filtered for GWAS size)
analysis/overlaps_within.Rda - overlaps between ARD GWAS hits
analysis/overlaps_within.Rda - overlaps with non-ARD traits
analysis/qtl_coloc - ARD hit colocalisations with xQTL studies
analysis/graph_all_morbidities.Rda - graph with all ARDs
analysis/individual_disease_graphs.Rda - graphs for each ARD individually
analysis/individual_disease_n_communities.Rda - number of communities in clusters in individual ARD graphs
analysis/ltg_all.Rda - L2G mappings for for all nodes (lead variants) in the graph
analysis/l2g_qtl_coloc.Rda - L2G mappings for all xQTLs that colocalise with ARD hits
analysis/top_l2g.Rda - top gene by L2G per lead variant
analysis/targetage_geneids.Rda - ENSEMBL IDs of TargetAge genes
analysis/targetage_annotations.Rda - Open Targets annotations for all TargetAge genes
analysis/diseases_with_associations.Rda - list of diseases with GWAS associations

Step 3: prepare_app_data.R

This script generates the data required for the TargetAge appplication.

Data

data/analysis/target_annotations.Rda - TargetAge annotations generated by the targetage_analysis.R script.
ltg_all.Rda
data/analysis/graph_all_morbidities.Rda
data/analysis/ard_leads_filtered.Rda
data/analysis/top_l2g.Rda

[1] Ochoa et al., Open Targets Platform: supporting systematic drug–target identification and prioritisation, Nucleic Acids Research (2021) https://doi.org/10.1093/nar/gkaa1027 ↩

[2] Ghoussaini et al., Open Targets Genetics: systematic identification of trait-associated genes using large-scale genetics and functional genomics, Nucleic Acids Research (2021) https://doi.org/10.1093/nar/gkaa840. ↩

Name		Name	Last commit message	Last commit date
Latest commit History 63 Commits
data		data
data_deposit		data_deposit
scripts		scripts
.gitignore		.gitignore
ARDs_coloc_analysis.Rmd		ARDs_coloc_analysis.Rmd
LICENSE		LICENSE
README.md		README.md
prepare_app_data.R		prepare_app_data.R
spark_targetage.py		spark_targetage.py
targetage-pipeline.Rproj		targetage-pipeline.Rproj
targetage_analysis.R		targetage_analysis.R
targetage_analysis_functions.R		targetage_analysis_functions.R
targetage_paper_figures.R		targetage_paper_figures.R

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

targetage-pipeline

Step 1: spark_targetage.py

Data

OT_platform (parquet format)

OT_genetics (json format)

Analysis

Expansion to specific diseases

Get target-disease associations and associated target annotations

Get Open Target Genetics evidence

Establishing independent genetic signals

Output files

Save output

Step 2: targetage_analysis.R

Intermediary files

Step 3: prepare_app_data.R

Data

About

Releases

Packages

Languages

License

CMD-Oxford/targetage-pipeline

Folders and files

Latest commit

History

Repository files navigation

targetage-pipeline

Step 1: spark_targetage.py

Data

OT_platform (parquet format)

OT_genetics (json format)

Analysis

Expansion to specific diseases

Get target-disease associations and associated target annotations

Get Open Target Genetics evidence

Establishing independent genetic signals

Output files

Save output

Step 2: targetage_analysis.R

Intermediary files

Step 3: prepare_app_data.R

Data

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages