-
Notifications
You must be signed in to change notification settings - Fork 0
01 Worked Example
This walkthrough uses a limma comparison of FLT3-IR (drug-resistant) vs FLT3-overexpressing K562 cells (GSE226360, ~12,000 genes) to demonstrate a complete DSGE analysis.
library(DSGE)
library(org.Hs.eg.db)DSGE accepts p-values from any differential expression tool — DESeq2, limma, edgeR, Seurat, etc. The only required column is pvalue.
The package ships with example data in inst/data_exp/. Let's load it:
res <- read.csv("inst/data_exp/limma_FLT3_IR_vs_FLT3.csv",
stringsAsFactors = FALSE)
dim(res)
# [1] 12191 6
colnames(res)
# [1] "gene" "log2FoldChange" "AveExpr" "t"
# [5] "pvalue" "padj"We have 6 columns, about 12,000 genes. The columns we need:
-
pvalue— nominal p-values (required by DSGE) -
AveExpr— average expression level (optional, used for low-expression filtering) -
log2FoldChange— effect size (optional, used for directional NDS analysis)
DSGE needs to know which genes belong to which GO pathways. The function get_pathway_genes_db() builds this mapping from a Bioconductor OrgDb, which already contains curated GO annotations.
pw <- get_pathway_genes_db(
org.Hs.eg.db,
min_size = 5 # drop GO terms with fewer than 5 genes
)
length(pw)
# [1] 13001 (number of GO terms with >= 5 genes)The output pw is a named list — each name is a GO term ID, and each element contains the genes in that pathway:
str(pw[[1]])
# 'data.frame': 21 obs. of 4 variables:
# $ db_object_id : chr "266" "84720" ...
# $ db_object_symbol : chr "AMBRA1" "MYO19" ...
# $ gs_name : chr "mitochondrion inheritance"
# $ gs_source : chr "BP"min_size — pathways smaller than this are excluded. Below 5, the DSGE null distribution becomes very noisy; above 500, pathways are too broad to be informative.
aspect — filter by ontology. "BP" (Biological Process), "MF" (Molecular Function), "CC" (Cellular Component), or NULL for all three. Not set here, so we get all.
keytype — gene ID type in the OrgDb. Default "ENTREZID". If your DE data uses gene symbols, the default works because get_pathway_genes_db() resolves Entrez IDs to symbols automatically (stored in db_object_symbol).
evidence — restrict to specific evidence codes, e.g. c("IDA", "IPI") for experimentally supported annotations only. Default NULL includes all.
If your species doesn't have an OrgDb, use raw GO annotation files:
gaf <- read_gaf("data/goa_human.gaf")
go <- read_obo("data/go.obo")
pw <- get_pathway_genes(gaf, min_size = 5, aspect = "P", go_names = go)This is the main function. It computes DSGE for every pathway, generates permutation nulls, and returns significance estimates.
Each gene's p-value is first converted to an absolute z-score:
where
To assess significance, DSGE uses size-grouped permutation: pathways with the same number of matched genes share one null distribution, so total computation is |unique sizes| × n_perm instead of |pathways| × n_perm. For ~12,000 genes with pathway sizes 5–500, there are ~200 unique sizes — so 10,000 permutations each is ~2 million draws, far cheaper than 13,000 × 10,000 = 130 million.
result <- pathway_dsge(
# --- Input data ---
pathway_genes = pw,
pvalue = res$pvalue,
base_mean = res$AveExpr,
gene_names = res$gene,
gene_id_col = "db_object_symbol",
base_mean_cutoff = 0.1,
min_size = 5,
max_size = 500,
# --- Permutation ---
n_perm = 100000,
seed = 42,
progress = TRUE,
# --- NDS direction ---
directional = TRUE,
direction_vec = res$log2FoldChange,
nds_top_frac = 0.25,
# --- GPD tail ---
use_gpd = TRUE,
gpd_threshold = 0.99,
# --- Correction ---
p_adjust_method = "BY",
# --- Plotting ---
return_null = TRUE
)
result_tbl <- result$tablebase_mean and base_mean_cutoff — filter out lowly-expressed genes.
base_mean = res$AveExpr supplies the expression levels; base_mean_cutoff = 0.1 removes any gene with AveExpr <= 0.1. This reduces noise from unexpressed or very lowly expressed genes. Set base_mean = NULL to skip filtering entirely.
gene_id_col — tells DSGE which column in the pathway data.frames to match against gene_names. Default "db_object_symbol" works when both your DE results and GAF use gene symbols. If your DE uses Entrez IDs, set to "db_object_id".
min_size / max_size — restrict the size range of tested pathways. Default 5–500. Pathways with very few genes give unreliable DSGE estimates; very large pathways tend to be biologically non-specific.
n_perm — permutations per size group. Higher values give finer p-value resolution. 10,000 is fine for exploration, 100,000 is recommended for publication. The permutation engine uses Rcpp (C++), so each permutation pass is fast — the main bottleneck is the total number of draws, not the per-draw cost.
A note on parallelisation: n_cores is supported via parallel::mclapply, but this only works on Linux/macOS. On Windows, mclapply falls back to single-threaded execution. The Rcpp implementation is already efficient enough that single-threaded runs are practical.
use_std (= TRUE by default) — adds a dsge_std column:
This standardises the observed DSGE against its size-matched null, producing a z-score-like metric that is comparable across pathways of different sizes.
directional + direction_vec — compute the Normalized Direction Score (NDS).
When directional = TRUE, you provide direction_vec (e.g. log2FoldChange) and DSGE computes whether a perturbed pathway is net up- or down-regulated:
where U = sum of z-scores of up-regulated genes (direction > 0) and D = sum of z-scores of down-regulated genes (direction < 0). The result ranges from -1 (all down) to +1 (all up).
nds_top_frac (= 0.25) — only the top 25% of genes with the largest absolute direction_vec are used for NDS. This excludes weakly perturbed genes that add noise to the direction signal. If the top subset has fewer than 10 genes, all genes are used as a fallback.
NDS is a descriptive metric — statistical significance is determined by the DSGE permutation p-value, not by NDS.
use_gpd (= TRUE) — enables Generalized Pareto Distribution tail extrapolation.
The empirical null distribution from n_perm draws can only resolve p-values down to 1/n_perm (e.g. 1e-5 for 100,000 permutations). When the observed DSGE is more extreme than any null draw, the empirical p-value would be 0, which is not useful.
GPD extrapolation solves this by fitting a Generalized Pareto Distribution to the upper tail of the null (above the gpd_threshold quantile, default 0.99) and using it to assign fine-grained p-values to extreme observations. A support-constrained adjustment (Peschel et al. 2025) prevents the GPD from returning p = 0 when it predicts a finite upper bound.
Set use_gpd = FALSE for pure empirical ECDF (p-values always >= 1/n_perm).
p_adjust_method (= "BY") — multiple testing correction.
Default is Benjamini-Yekutieli (BY), which controls the False Discovery Rate under arbitrary dependence. This is appropriate for GO analysis because pathways share genes and are not independent. For less conservative correction (assuming independence), use "BH".
Supported: "holm", "hochberg", "hommel", "bonferroni", "BH", "fdr", "BY", "none".
return_null (= TRUE) — stores the full null distributions in the result object. This is required for plot_dsge(). The stored data can be several hundred MB.
dim(result_tbl)
# [1] 13001 10
names(result_tbl)
# [1] "gs_id" "gs_name" "gs_source" "n_pathway" "n_matched"
# [6] "dsge" "dsge_std" "nds" "p_value" "p_adj"| Column | Meaning |
|---|---|
gs_id |
Gene set / pathway identifier |
gs_name |
Human-readable name |
gs_source |
Source database: BP / MF / CC / KEGG / REACTOME |
n_pathway |
Total genes in the annotation |
n_matched |
Genes matched in the expression data (this is the size used for permutation grouping) |
dsge |
Observed DSGE: mean absolute z-score |
dsge_std |
Standardised DSGE: (observed - null mean) / null SD |
nds |
Normalized Direction Score: -1 to +1 |
p_value |
Permutation p-value (GPD-extrapolated if above threshold) |
p_adj |
Adjusted p-value (BY correction) |
How many significant pathways?
sum(result_tbl$p_adj < 0.05)
# [1] ~3000Ontology breakdown of significant terms:
table(result_tbl$gs_source[result_tbl$p_adj < 0.05])
# BP CC MF
# ~1800 ~500 ~600Top 10 most significant pathways:
head(result_tbl[, c("gs_id", "gs_name", "n_matched", "dsge", "dsge_std", "p_adj")], 10)Extreme directionality (up/down):
# Most up-regulated (NDS close to +1)
head(result_tbl[order(result_tbl$nds, decreasing = TRUE),
c("gs_id", "gs_name", "nds", "p_adj")])
# Most down-regulated (NDS close to -1)
head(result_tbl[order(result_tbl$nds, decreasing = FALSE),
c("gs_id", "gs_name", "nds", "p_adj")])Save to CSV:
write.csv(result_tbl, "FLT3_IR_vs_FLT3_DSGE.csv", row.names = FALSE)Use dsge_perm_test() to test one custom gene list without running the full pipeline. The same parameters apply.
my_genes <- c("FLT3", "STAT5A", "STAT5B", "PIK3CA", "AKT1")
test <- dsge_perm_test(
gene_list = my_genes,
pvalue = res$pvalue,
base_mean = res$AveExpr,
gene_names = res$gene,
base_mean_cutoff = 0.1,
n_perm = 100000,
seed = 42,
directional = TRUE,
direction_vec = res$log2FoldChange,
nds_top_frac = 0.25,
use_gpd = TRUE
)
test$observed # DSGE of your gene set
test$p_value # permutation p-value
test$dsge_std # standardised DSGE
test$nds # direction score (-1 to +1)Shows the size-matched null distribution for each pathway, with observed DSGE marked as a red dashed line and the GPD tail region highlighted in orange.
# Top 9 by p_adj
plot_dsge(result, n = 9)
# Specific pathways
plot_dsge(result, gs_ids = c("GO:0030099", "GO:0002521"))
# Standardised scale (recommended — makes different pathway sizes comparable)
plot_dsge(result, n = 9, use_std = TRUE)
# Save to PDF
pdf("top9_FLT3_DSGE.pdf", width = 12, height = 10)
plot_dsge(result, n = 9, use_std = TRUE)
dev.off()Plot significance vs DSGE or NDS for all tested pathways:
plot_dsge_volcano(result_tbl, x = "dsge_std", p_col = "p_adj",
label_n = 10, label_by = "gs_name")Using the same DE results, analyse KEGG pathways via the get_pathway_genes_kegg() function. Pathway names are fetched online via KEGGREST.
# Build KEGG pathway map
pw_kegg <- get_pathway_genes_kegg(org.Hs.eg.db, min_size = 5L)
length(pw_kegg)
# [1] 223 (KEGG pathways with >= 5 genes)
str(pw_kegg[[1]])
# Classes 'kegg_pathway' and 'data.frame': 65 obs. of 4 variables:
# $ kegg_name : chr "Glycolysis / Gluconeogenesis - Homo sapiens (human)"
# $ organism_code : chr "hsa"
# $ kegg_gene_id : chr "124" "125" "126" ...
# $ kegg_gene_symbol: chr "ADH1A" "ADH1B" "ADH1C" ...
# Run DSGE — gene_id_col auto-detected from S3 class
result_kegg <- pathway_dsge(
pathway_genes = pw_kegg,
pvalue = res$pvalue,
base_mean = res$AveExpr,
gene_names = res$gene,
min_size = 10L,
max_size = 500L,
n_perm = 100000,
seed = 42,
directional = TRUE,
direction_vec = res$log2FoldChange,
return_null = TRUE
)
tbl_kegg <- result_kegg$table
cat("KEGG pathways tested:", nrow(tbl_kegg),
"| p_adj<0.05:", sum(tbl_kegg$p_adj < 0.05, na.rm = TRUE), "\n")
# Columns are source-specific: kegg_id, kegg_name
head(tbl_kegg[, c("kegg_id", "kegg_name", "n_matched", "dsge", "dsge_std", "p_adj")])
# Volcano plot (specify the correct ID column)
plot_dsge_volcano(tbl_kegg, x = "dsge_std", p_col = "p_adj",
id_col = "kegg_id", label_by = "kegg_name", label_n = 10)Analyse Reactome pathways via get_pathway_genes_reactome(). Pathway names come from the local reactome.db package — no network needed.
# Build Reactome pathway map
pw_react <- get_pathway_genes_reactome(org.Hs.eg.db, min_size = 5L)
length(pw_react)
# [1] ~1200 (Reactome pathways with >= 5 genes)
str(pw_react[[1]])
# Classes 'reactome_pathway' and 'data.frame': 96 obs. of 3 variables:
# $ reactome_name : chr "Signaling by Interleukins"
# $ reactome_gene_id : chr "1" "2" "3" ...
# $ reactome_gene_symbol: chr "A1BG" "A2M" "A2MP1" ...
# Run DSGE
result_react <- pathway_dsge(
pathway_genes = pw_react,
pvalue = res$pvalue,
base_mean = res$AveExpr,
gene_names = res$gene,
min_size = 10L,
max_size = 500L,
n_perm = 100000,
seed = 42,
directional = TRUE,
direction_vec = res$log2FoldChange,
return_null = TRUE
)
tbl_react <- result_react$table
head(tbl_react[, c("reactome_id", "reactome_name", "n_matched", "dsge", "dsge_std", "p_adj")])
# Mixing sources in the same volcano plot
plot_dsge_volcano(tbl_react, x = "dsge_std", p_col = "p_adj",
id_col = "reactome_id", label_by = "reactome_name", label_n = 10)Because pathway_dsge() uses the same core algorithm regardless of source, you can freely compare results:
# Bind results from different sources (different columns)
list(
GO = result$table[, c("go_id", "go_name", "dsge_std", "p_adj")],
KEGG = result_kegg$table[, c("kegg_id", "kegg_name", "dsge_std", "p_adj")],
Reactome = result_react$table[, c("reactome_id", "reactome_name", "dsge_std", "p_adj")]
)
# All results carry the source as an attribute
attr(result$table, "pathway_source") # "GO"
attr(result_kegg$table, "pathway_source") # "KEGG"
attr(result_react$table, "pathway_source") # "REACTOME"