This vignette is part of the workshop: [Hands-on Tour of the Visium Spatial Gene Expression Analysis Journey](https://www.10xgenomics.com/analysis-guides/workshop-visium-hd-analysis).

We will begin by installing the necessary packages.

In [None]:
install.packages("remotes")
install.packages("devtools")
system("apt install libgsl-dev", intern=TRUE)
devtools::install_github("paulponcet/lplyr")

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install(version = '3.19',ask = FALSE)

BiocManager::install("clusterProfiler")
BiocManager::install("enrichplot")
BiocManager::install("ggplot2")
BiocManager::install("msigdbr")
BiocManager::install("dplyr")
BiocManager::install("DOSE")
BiocManager::install("forcats")
BiocManager::install("AnnotationDbi")
BiocManager::install("org.Hs.eg.db")

Next, load these required packages into the environment.

In [None]:
library(clusterProfiler)
library(enrichplot)
library(ggplot2)
library(msigdbr)
library(dplyr)
library(lplyr)
library(DOSE)
library(forcats)
library(AnnotationDbi)
library(org.Hs.eg.db)

Read the in CSV file containing gene expression information and take a look at the dataframe

In [None]:
download.file("https://raw.githubusercontent.com/10XGenomics/analysis_guides/main/Visium_HD_GSEA/ROI_Features.csv", "ROI_Features.csv")
df <- read.csv("ROI_Features.csv", header = TRUE)
head(df)

How many genes are in our list?

In [None]:
nrow(df)

GSEA requires Entrez IDs as input. These are unique integer identifiers for genes from NCBI. We will use AnnotationDbi to find the IDs for each gene symbol in our list and take a look at the top of the data frame to see what it looks like.

In [None]:
entrez_data <- AnnotationDbi::select(org.Hs.eg.db, keys = df$SYMBOL,columns = c("SYMBOL", "ENTREZID"),keytype = "SYMBOL")
head(entrez_data)

Not every gene symbol has an associated Entrez ID. We will need to remove genes that do not have an Entrez ID ("NA") from our input list.

In [None]:
anno_result <- entrez_data %>%
  filter(!is.na(ENTREZID)) %>%
  inner_join(df, by = "SYMBOL",relationship = "many-to-many")

head(anno_result)

How many genes remain after filtering NAs?

In [None]:
nrow(anno_result)

The starting gene number was 18,072. So, we removed 437 or ~2% of our gene list due to NAs. This is not a large proportion of our gene list and so we will proceed assuming the list is still representative of our sample.

Next, we will turn this data frame into a named vector that has the Entrez ID and associated Log2FC information and sort it largest Log2FC to smallest.

In [None]:
geneList <- with(anno_result, setNames(Log2FC, ENTREZID))
geneList <- sort(geneList, decreasing = TRUE)

Next, we need to prepare our reference gene sets that we will use for analysis. There are a number of databases available to use. We chose to use a popular one, Gene Ontology: Molecular Function. We can access the gene sets in this database with the msigdbr package from MSigDB. To learn more about the gene sets available through this resource, go to this web page: https://www.gsea-msigdb.org/gsea/msigdb/human/collections.jsp

We find there is a database of computational gene sets defined by expression neighborhoods centered around cancer-associated genes. This database is relevant to our data, so we will use this one.

In [None]:
c4_t2g <- msigdbr(species = "Homo sapiens", category = "C4", subcategory = "CGN") %>%
  dplyr::select(gs_name, entrez_gene)
head(c4_t2g)

Now that we have our ranked gene list and gene set reference inputs, we can run GSEA.

In [None]:
c4 <- GSEA(geneList, TERM2GENE = c4_t2g)
c4

The gene information is stored as Entrez ID, so we need to add the gene symbols into the c4 object to be able to interpret the results better. We will also put our results into a dataframe so they are easier to navigate.

In [None]:
c4 <- setReadable(c4, 'org.Hs.eg.db', 'ENTREZID')
c4_df <- c4@result

Our analysis is complete and stored in an object we called “c4”. There were ~170-180 enriched terms found in our data. Results may vary slightly due to random seeding.

We can use the object storing the results to generate some plots of our results. First, we will take a look at the overall summary results. We can visualize the top 10 gene sets that were up-regulated (activated in microenvironment), meaning these gene sets were enriched at the top of our list, where up-regulated genes were located. And we will also visualize the top 10 gene sets that were down-regulated (suppressed in microenvironment), meaning these gene sets were enriched at the bottom of our list, where down-regulated genes were located.

In [None]:
#set plot size
options(repr.plot.width=10, repr.plot.height=6)

#sort data and assign to "activated" or "suppressed" groups based on NES
sorted_c4<- c4@result[order(c4@result$NES, decreasing = F),]
sorted_c4$color<-ifelse(sorted_c4$NES<0, "Enriched in tumor", "Enriched in TME")

#plot results
sorted_c4 %>%
  dplyr::group_by(color) %>%
  dplyr::arrange(desc(abs(NES))) %>%
  slice_head(n = 10) %>%
  ggplot(aes(x = NES, y = reorder(Description, NES), fill = color)) +
  geom_bar(stat = "identity") +
  geom_vline(xintercept = 0) +
  labs(y = "Description") +
  theme_classic() +
  scale_fill_manual(values=c("#ffa557", "#4296f5")) +
  theme(legend.position = "right")

#save a publicaiton-quality version of the plot
ggsave("go_barplot.pdf",
        dpi = 600,
        width = 30, height = 15, unit = "cm")

Next, we can select one of these gene sets to explore more in-depth.


The database we are using organizes the gene pathways into sets of genes centered around known cancer-related genes. We will need to make use of the GSEA website to understand what each gene set is.

For example, looking into one of the modules enriched in the microenvironment region, [GNF2_PECAM1](https://www.gsea-msigdb.org/gsea/msigdb/human/geneset/GNF2_PECAM1.html), we can learn more about it. First, we can take a look at the full list of genes in this gene set.

Based on the literature, we may find promising cancer treatment targets by looking at the genes in this gene set. We can generate a GSEA plot and a cnet plot for this module to evaluate the full set of genes in our data:

In [None]:
#re-set plot size
options(repr.plot.width=8, repr.plot.height=6)

#gsea plot
gseaplot(c4, by = "all", title = "PECAM1 Network", geneSetID = "GNF2_PECAM1")

And pull out the NES and adjusted p-value associated with this enriched set.

In [None]:
c4_df["GNF2_PECAM1",]

Of the 56 genes in this PECAM1 Gene Set, 33 were detected in our experiment. We can visualize the expression of these genes relative to the tumor region in the cnet plot below:


In [None]:
cnetplot(c4, categorySize="pvalue", color.params = list(foldChange=geneList), showCategory = c("GNF2_PECAM1"))

Platelet endothelial cell adhesion molecule (PECAM-1) is a cell-cell adhesion protein found on endothelial cells, platelets, macrophages and Kupffer cells, granulocytes, lymphocytes (T cells, B cells, and NK cells), megakaryocytes, and osteoclasts.Going to the literature, this gene has been studied in the cancer field. Specifically, [one study](https://doi.org/10.1073/pnas.1004654107) found that PECAM-1 in the tumor microenvironment drives advanced metastatic progression of tumor cells. [Another](https://doi.org/10.1023/A:1009092107382) found that an antibody against murine PECAM-1 inhibits tumor angiogenesis in mice.

We may find promising cancer treatment targets by looking at the genes in this gene set. The PECAM1 neighborhood identified with GSEA may represent a network of genes that are involved in regulation of the the tumor microenvironment to make the area more susceptible to tumor invasion. These genes may be worth investigating in future studies and specifically targeting in future cancer treatment development. The last thing we will do is go back to Loupe Browser to visualize the expression of these genes in our tissue.

In [None]:
sessionInfo()