Skip to content

01 Worked Example

songif edited this page Jun 15, 2026 · 5 revisions

Worked Example: FLT3-IR vs FLT3 in K562 CML Cells

This walkthrough uses a limma comparison of FLT3-IR (drug-resistant) vs FLT3-overexpressing K562 cells (GSE226360, ~12,000 genes) to demonstrate a complete DSGE analysis.

Setup

library(DSGE)
library(org.Hs.eg.db)

1. Import DE Results

DSGE accepts p-values from any differential expression tool — DESeq2, limma, edgeR, Seurat, etc. The only required column is pvalue.

The package ships with example data in inst/data_exp/. Let's load it:

res <- read.csv("inst/data_exp/limma_FLT3_IR_vs_FLT3.csv",
                 stringsAsFactors = FALSE)

dim(res)
# [1] 12191     6

colnames(res)
# [1] "gene"           "log2FoldChange" "AveExpr"        "t"
# [5] "pvalue"         "padj"

We have 6 columns, about 12,000 genes. The columns we need:

  • pvalue — nominal p-values (required by DSGE)
  • AveExpr — average expression level (optional, used for low-expression filtering)
  • log2FoldChange — effect size (optional, used for directional NDS analysis)

2. Build Pathway-Gene Map

DSGE needs to know which genes belong to which GO pathways. The function get_pathway_genes_db() builds this mapping from a Bioconductor OrgDb, which already contains curated GO annotations.

pw <- get_pathway_genes_db(
  org.Hs.eg.db,
  min_size  = 5        # drop GO terms with fewer than 5 genes
)

length(pw)
# [1] 13001  (number of GO terms with >= 5 genes)

The output pw is a named list — each name is a GO term ID, and each element contains the genes in that pathway:

str(pw[[1]])
# 'data.frame':  21 obs. of  4 variables:
#  $ db_object_id     : chr  "266" "84720" ...
#  $ db_object_symbol : chr  "AMBRA1" "MYO19" ...
#  $ gs_name          : chr  "mitochondrion inheritance"
#  $ gs_source        : chr  "BP"

Key parameters of get_pathway_genes_db()

min_size — pathways smaller than this are excluded. Below 5, the DSGE null distribution becomes very noisy; above 500, pathways are too broad to be informative.

aspect — filter by ontology. "BP" (Biological Process), "MF" (Molecular Function), "CC" (Cellular Component), or NULL for all three. Not set here, so we get all.

keytype — gene ID type in the OrgDb. Default "ENTREZID". If your DE data uses gene symbols, the default works because get_pathway_genes_db() resolves Entrez IDs to symbols automatically (stored in db_object_symbol).

evidence — restrict to specific evidence codes, e.g. c("IDA", "IPI") for experimentally supported annotations only. Default NULL includes all.

Alternative: GAF + OBO mode

If your species doesn't have an OrgDb, use raw GO annotation files:

gaf <- read_gaf("data/goa_human.gaf")
go  <- read_obo("data/go.obo")
pw  <- get_pathway_genes(gaf, min_size = 5, aspect = "P", go_names = go)

3. Core Analysis: pathway_dsge()

This is the main function. It computes DSGE for every pathway, generates permutation nulls, and returns significance estimates.

How DSGE works

Each gene's p-value is first converted to an absolute z-score:

$$z_i = |\Phi^{-1}(1 - p_i / 2)|$$

where $\Phi^{-1}$ is the inverse normal CDF. A pathway's DSGE is simply the mean of these z-scores across its member genes. Higher values = stronger transcriptional perturbation.

To assess significance, DSGE uses size-grouped permutation: pathways with the same number of matched genes share one null distribution, so total computation is |unique sizes| × n_perm instead of |pathways| × n_perm. For ~12,000 genes with pathway sizes 5–500, there are ~200 unique sizes — so 10,000 permutations each is ~2 million draws, far cheaper than 13,000 × 10,000 = 130 million.

Running the analysis

result <- pathway_dsge(
  # --- Input data ---
  pathway_genes    = pw,
  pvalue           = res$pvalue,
  base_mean        = res$AveExpr,
  gene_names       = res$gene,
  gene_id_col      = "db_object_symbol",
  base_mean_cutoff = 0.1,
  min_size         = 5,
  max_size         = 500,

  # --- Permutation ---
  n_perm           = 100000,
  seed             = 42,
  progress         = TRUE,

  # --- NDS direction ---
  directional      = TRUE,
  direction_vec    = res$log2FoldChange,
  nds_top_frac     = 0.25,

  # --- GPD tail ---
  use_gpd          = TRUE,
  gpd_threshold    = 0.99,

  # --- Correction ---
  p_adjust_method  = "BY",

  # --- Plotting ---
  return_null      = TRUE
)

result_tbl <- result$table

Parameter walkthrough


base_mean and base_mean_cutoff — filter out lowly-expressed genes.

base_mean = res$AveExpr supplies the expression levels; base_mean_cutoff = 0.1 removes any gene with AveExpr <= 0.1. This reduces noise from unexpressed or very lowly expressed genes. Set base_mean = NULL to skip filtering entirely.


gene_id_col — tells DSGE which column in the pathway data.frames to match against gene_names. Default "db_object_symbol" works when both your DE results and GAF use gene symbols. If your DE uses Entrez IDs, set to "db_object_id".


min_size / max_size — restrict the size range of tested pathways. Default 5–500. Pathways with very few genes give unreliable DSGE estimates; very large pathways tend to be biologically non-specific.


n_perm — permutations per size group. Higher values give finer p-value resolution. 10,000 is fine for exploration, 100,000 is recommended for publication. The permutation engine uses Rcpp (C++), so each permutation pass is fast — the main bottleneck is the total number of draws, not the per-draw cost.

A note on parallelisation: n_cores is supported via parallel::mclapply, but this only works on Linux/macOS. On Windows, mclapply falls back to single-threaded execution. The Rcpp implementation is already efficient enough that single-threaded runs are practical.


use_std (= TRUE by default) — adds a dsge_std column:

$$dsge_std = \frac{\text{observed} - \text{mean(null)}}{\text{sd(null)}}$$

This standardises the observed DSGE against its size-matched null, producing a z-score-like metric that is comparable across pathways of different sizes.


directional + direction_vec — compute the Normalized Direction Score (NDS).

When directional = TRUE, you provide direction_vec (e.g. log2FoldChange) and DSGE computes whether a perturbed pathway is net up- or down-regulated:

$$NDS = \frac{U - D}{\max(U, D)}$$

where U = sum of z-scores of up-regulated genes (direction > 0) and D = sum of z-scores of down-regulated genes (direction < 0). The result ranges from -1 (all down) to +1 (all up).

nds_top_frac (= 0.25) — only the top 25% of genes with the largest absolute direction_vec are used for NDS. This excludes weakly perturbed genes that add noise to the direction signal. If the top subset has fewer than 10 genes, all genes are used as a fallback.

NDS is a descriptive metric — statistical significance is determined by the DSGE permutation p-value, not by NDS.


use_gpd (= TRUE) — enables Generalized Pareto Distribution tail extrapolation.

The empirical null distribution from n_perm draws can only resolve p-values down to 1/n_perm (e.g. 1e-5 for 100,000 permutations). When the observed DSGE is more extreme than any null draw, the empirical p-value would be 0, which is not useful.

GPD extrapolation solves this by fitting a Generalized Pareto Distribution to the upper tail of the null (above the gpd_threshold quantile, default 0.99) and using it to assign fine-grained p-values to extreme observations. A support-constrained adjustment (Peschel et al. 2025) prevents the GPD from returning p = 0 when it predicts a finite upper bound.

Set use_gpd = FALSE for pure empirical ECDF (p-values always >= 1/n_perm).


p_adjust_method (= "BY") — multiple testing correction.

Default is Benjamini-Yekutieli (BY), which controls the False Discovery Rate under arbitrary dependence. This is appropriate for GO analysis because pathways share genes and are not independent. For less conservative correction (assuming independence), use "BH".

Supported: "holm", "hochberg", "hommel", "bonferroni", "BH", "fdr", "BY", "none".


return_null (= TRUE) — stores the full null distributions in the result object. This is required for plot_dsge(). The stored data can be several hundred MB.

4. Inspect Results

dim(result_tbl)
# [1] 13001    10

names(result_tbl)
# [1] "gs_id"     "gs_name"   "gs_source" "n_pathway" "n_matched"
# [6] "dsge"      "dsge_std"  "nds"       "p_value"   "p_adj"

Result columns

Column Meaning
gs_id Gene set / pathway identifier
gs_name Human-readable name
gs_source Source database: BP / MF / CC / KEGG / REACTOME
n_pathway Total genes in the annotation
n_matched Genes matched in the expression data (this is the size used for permutation grouping)
dsge Observed DSGE: mean absolute z-score
dsge_std Standardised DSGE: (observed - null mean) / null SD
nds Normalized Direction Score: -1 to +1
p_value Permutation p-value (GPD-extrapolated if above threshold)
p_adj Adjusted p-value (BY correction)

Key questions

How many significant pathways?

sum(result_tbl$p_adj < 0.05)
# [1] ~3000

Ontology breakdown of significant terms:

table(result_tbl$gs_source[result_tbl$p_adj < 0.05])

#   BP   CC   MF
# ~1800 ~500 ~600

Top 10 most significant pathways:

head(result_tbl[, c("gs_id", "gs_name", "n_matched", "dsge", "dsge_std", "p_adj")], 10)

Extreme directionality (up/down):

# Most up-regulated (NDS close to +1)
head(result_tbl[order(result_tbl$nds, decreasing = TRUE),
                c("gs_id", "gs_name", "nds", "p_adj")])

# Most down-regulated (NDS close to -1)
head(result_tbl[order(result_tbl$nds, decreasing = FALSE),
                c("gs_id", "gs_name", "nds", "p_adj")])

Save to CSV:

write.csv(result_tbl, "FLT3_IR_vs_FLT3_DSGE.csv", row.names = FALSE)

5. Test a Single Gene Set

Use dsge_perm_test() to test one custom gene list without running the full pipeline. The same parameters apply.

my_genes <- c("FLT3", "STAT5A", "STAT5B", "PIK3CA", "AKT1")

test <- dsge_perm_test(
  gene_list        = my_genes,
  pvalue           = res$pvalue,
  base_mean        = res$AveExpr,
  gene_names       = res$gene,
  base_mean_cutoff = 0.1,
  n_perm           = 100000,
  seed             = 42,
  directional      = TRUE,
  direction_vec    = res$log2FoldChange,
  nds_top_frac     = 0.25,
  use_gpd          = TRUE
)

test$observed      # DSGE of your gene set
test$p_value       # permutation p-value
test$dsge_std      # standardised DSGE
test$nds           # direction score (-1 to +1)

6. Visualise

Null distribution plots

Shows the size-matched null distribution for each pathway, with observed DSGE marked as a red dashed line and the GPD tail region highlighted in orange.

# Top 9 by p_adj
plot_dsge(result, n = 9)

# Specific pathways
plot_dsge(result, gs_ids = c("GO:0030099", "GO:0002521"))

# Standardised scale (recommended — makes different pathway sizes comparable)
plot_dsge(result, n = 9, use_std = TRUE)

# Save to PDF
pdf("top9_FLT3_DSGE.pdf", width = 12, height = 10)
plot_dsge(result, n = 9, use_std = TRUE)
dev.off()

Volcano plot

Plot significance vs DSGE or NDS for all tested pathways:

plot_dsge_volcano(result_tbl, x = "dsge_std", p_col = "p_adj",
                  label_n = 10, label_by = "gs_name")

7. KEGG Pathway Analysis

Using the same DE results, analyse KEGG pathways via the get_pathway_genes_kegg() function. Pathway names are fetched online via KEGGREST.

# Build KEGG pathway map
pw_kegg <- get_pathway_genes_kegg(org.Hs.eg.db, min_size = 5L)

length(pw_kegg)
# [1] 223  (KEGG pathways with >= 5 genes)

str(pw_kegg[[1]])
# Classes 'kegg_pathway' and 'data.frame':  65 obs. of  4 variables:
#  $ kegg_name       : chr  "Glycolysis / Gluconeogenesis - Homo sapiens (human)"
#  $ organism_code   : chr  "hsa"
#  $ kegg_gene_id    : chr  "124" "125" "126" ...
#  $ kegg_gene_symbol: chr  "ADH1A" "ADH1B" "ADH1C" ...

# Run DSGE — gene_id_col auto-detected from S3 class
result_kegg <- pathway_dsge(
  pathway_genes = pw_kegg,
  pvalue        = res$pvalue,
  base_mean     = res$AveExpr,
  gene_names    = res$gene,
  min_size      = 10L,
  max_size      = 500L,
  n_perm        = 100000,
  seed          = 42,
  directional   = TRUE,
  direction_vec = res$log2FoldChange,
  return_null   = TRUE
)

tbl_kegg <- result_kegg$table
cat("KEGG pathways tested:", nrow(tbl_kegg),
    "| p_adj<0.05:", sum(tbl_kegg$p_adj < 0.05, na.rm = TRUE), "\n")

# Columns are source-specific: kegg_id, kegg_name
head(tbl_kegg[, c("kegg_id", "kegg_name", "n_matched", "dsge", "dsge_std", "p_adj")])

# Volcano plot (specify the correct ID column)
plot_dsge_volcano(tbl_kegg, x = "dsge_std", p_col = "p_adj",
                  id_col = "kegg_id", label_by = "kegg_name", label_n = 10)

8. Reactome Pathway Analysis

Analyse Reactome pathways via get_pathway_genes_reactome(). Pathway names come from the local reactome.db package — no network needed.

# Build Reactome pathway map
pw_react <- get_pathway_genes_reactome(org.Hs.eg.db, min_size = 5L)

length(pw_react)
# [1] ~1200  (Reactome pathways with >= 5 genes)

str(pw_react[[1]])
# Classes 'reactome_pathway' and 'data.frame':  96 obs. of  3 variables:
#  $ reactome_name       : chr  "Signaling by Interleukins"
#  $ reactome_gene_id    : chr  "1" "2" "3" ...
#  $ reactome_gene_symbol: chr  "A1BG" "A2M" "A2MP1" ...

# Run DSGE
result_react <- pathway_dsge(
  pathway_genes = pw_react,
  pvalue        = res$pvalue,
  base_mean     = res$AveExpr,
  gene_names    = res$gene,
  min_size      = 10L,
  max_size      = 500L,
  n_perm        = 100000,
  seed          = 42,
  directional   = TRUE,
  direction_vec = res$log2FoldChange,
  return_null   = TRUE
)

tbl_react <- result_react$table
head(tbl_react[, c("reactome_id", "reactome_name", "n_matched", "dsge", "dsge_std", "p_adj")])

# Mixing sources in the same volcano plot
plot_dsge_volcano(tbl_react, x = "dsge_std", p_col = "p_adj",
                  id_col = "reactome_id", label_by = "reactome_name", label_n = 10)

Comparing across sources

Because pathway_dsge() uses the same core algorithm regardless of source, you can freely compare results:

# Bind results from different sources (different columns)
list(
  GO      = result$table[, c("go_id", "go_name", "dsge_std", "p_adj")],
  KEGG    = result_kegg$table[, c("kegg_id", "kegg_name", "dsge_std", "p_adj")],
  Reactome = result_react$table[, c("reactome_id", "reactome_name", "dsge_std", "p_adj")]
)

# All results carry the source as an attribute
attr(result$table, "pathway_source")       # "GO"
attr(result_kegg$table, "pathway_source")  # "KEGG"
attr(result_react$table, "pathway_source") # "REACTOME"

Clone this wiki locally