# Processing Pathway Information


![](./images/Module3/Workflow3.png)

## Overview

Differential analysis typically yields a list of genes or proteins. Our intention is to use such lists to gain novel insights about genes and proteins that may have roles in a given
phenomenon, phenotype or disease progression. However, in many cases, gene lists generated from DE analysis are
difficult to interpret due to their large size and lack of useful annotations. Hence, pathway analysis (also known
as gene set analysis or over-representation analysis), aims to reduce the complexity of interpreting gene lists
via mapping the listed genes to known (i.e. annotated) biological pathways, processes and functions.
This learning submodule introduces common curated biological databases including Gene Ontology (GO), the
Kyoto Encyclopedia of Genes and Genomes (KEGG) and the database of reactions, pathways and biological processes (REACTOME). It also provides the guideline to retrieve the pathways information from the pathway databases, process, and save them in the appropriate format for the pathway analysis in the next submodule. 


## Learning Objectives
1. Introduction to Ontology and Gene Ontology, KEGG Pathway Database, and REACTOME Pathway Database.
2. Download and save terms, pathway gene set from GO, KEGG, and REACTOME.

## Prerequisites

**1. R Packages:**

* `BiocManager`:  Required for installing packages from Bioconductor.
* `topGO`: Used for Gene Ontology (GO) enrichment analysis.
* `hgu133plus2.db`:  Annotation package for the hgu133plus2 microarray platform (used to link gene IDs to GO terms).
* `AnnotationDbi`: Provides tools for working with annotation databases.
* `GO.db`:  Annotation package for Gene Ontology.
* `KEGGREST`: Client interface for the KEGG REST API.
* `ReactomeContentService4R`:  Client interface for the Reactome Content Service API.
* `magick`: Image processing package (dependency of ReactomeContentService4R). This also requires `libmagick++-dev`.



**2. System dependency:**
* `libmagick++-dev`: It is required for installation of `magick` package.

## Get Started

### Table of Contents

1. [Ontology and Gene Ontology](#go-main)
   -  1.1. [Overview](#go-overview)
   -  1.2. [Retrieving GO terms](#go-retrieve)
2. [Kyoto Encyclopedia of Genes and Genomes (KEGG)](#kegg-main)
   -  2.1. [Overview](#kegg-overview)
   -  2.2. [Retrieving pathways from KEGG database](#kegg-retrieve)
3. [REACTOME Pathway Database](#reactome-main)
   -  3.1. [Overview](#reactome-overview)
   -  3.2. [Retrieving pathways from REACTOME database](#reactome-retrieve)


**Note**: It is worth to note that the pathway information from the curated biological databases introduced in this learning submodule are regularly updated. While users have the downloaded pathway information saved in cloud storage and do not need to redownload it (rerun the notebook) every time they perform pathway analysis, it is advisable for them to refresh the saved pathway data frequently to ensure that the data is up-to-date.

In [None]:
IRdisplay::display_html('<iframe src="../Quizzes/Quiz_Submodule3-1.html" width=100% height=250></iframe>')

<!-- headings -->
<a id="go-main"></a>
## 1. Ontology and Gene Ontology
<!-- headings -->
<a id="go-overview"></a>
### 1.1. Overview
In this section, we will learn about the concept of gene ontology in bioinformatics. Ontology is a set of concepts and categories defined by a shared vocabulary to denote  properties of the concepts, as well as the relationships between the concepts.
Ontology plays an important role in the field of bioinformatics. Ontology enables unambiguous communication e.g.,
a way to understand different groups’ annotations of various genomes. Also, it allows the knowledge to be structured to perform automated analyses by computer programs.

The Gene Ontology (GO) database defines a structured, common, and controlled vocabulary to describe attributes of genes and gene products
across organisms. Collaboration is key to build a consensus vocabulary. But the term gene ontology, or GO, is commonly used
to refer to both the terms as well as the associations between genes, which is sometimes a source of potential confusion. In order to avoid this, here we will use the term “GO”
to describe the set of terms and their hierarchical structure and “GO annotations” to describe the set of associations between
genes and GO terms. The GO is divided into three categories to describe the genes and gene products from three different
angles: Molecular Function, Biological Process, and Cellular Component.

The structure of GO can be described in terms of directed acyclic graphs (DAGs), where each GO term is a node,
and the relationships between the terms are edges. GO is loosely hierarchical, with ‘child’ terms
being more specialized than their ‘parent’ terms, but unlike a strict hierarchy, a term may have more than one parent
term (note that the parent/child model does not hold for all types of relations). The structure of the controlled
vocabularies is intended to reflect true, biological relationships. In contrast to strict hierarchies, DAGS allows
multiple relationships between a more granular (child) term and a more general parent term. The relationship between
terms affects how queries are made. For example, a query for all genes with binding activity would include transcription
factors as well as genes with other types of binding activity (such as protein binding, ligand binding). The illustration
of the category and structure of GO is shown in the figure below:

![](./images/Module3/GO_Structure.jpg)
*(Source: https://www.ebi.ac.uk/, http://geneontology.org/)*

#### Gene ontology relationship
In DAGs graph, *terms* are represented as *nodes* and *relations* (also known as *object properties*) between the *terms*
are *edges*. There are commonly used relationships in GO such as *`is a`* (is a *`subtype of`*), *`part of`, `has part`, `regulates`,
`negatively regulates` and `positively regulates`*. All terms (except for the root terms representing each aspect) have a sub-class relationship to another term.

Examples:

> **Example:**
>  
> **GO:1904659:glucose transport** *`is a`* **GO:0015749:monosaccharide transport**.
>  
> The *`is a`* relation forms the basic structure of GO. If we say A *`is a`* B, we mean that node A is a subtype of node B


> **Example:**
>  
> **GO:0031966:mitochondrial membrane** *`is part of`* **GO:0005740:mitochondrial envelope**
>  
> The *`part of`* relation is used to represent part-whole relationships. A *`part of`* relation would only be added between
A and B if B is **necessarily** *`part of`* A: wherever B exists, it is as *`part of`* A, and the presence of the B implies
the presence of A. However, given the occurrence of A, we cannot say for certain that B exists.


> **Example:**
>  
> **GO:0098689:latency-replication decision** *`regulates`* **GO:0019046:release from viral latency**
>  
> A relation that describes a case in which one process directly affects the manifestation of another process or quality,
> i.e. the former *`regulates`* the latter.


A more specific case with more nodes and edges can be seen in the figure below:
<br>
![](./images/Module3/GO_Relation.jpg)

*(Source: https://advaitabio.com/)* <br>
For more technical information about relations and their properties used in GO and other ontologies, see the
<a href="https://obofoundry.org/ontology/ro.html">OBO Relations Ontology (RO)</a>


In [None]:
#Run the following command to take the quiz
IRdisplay::display_html('<iframe src="../Quizzes/Quiz_Submodule3.html" width=100% height=250></iframe>')

#### GO storage file formats
GO terms are updated monthly in the following formats:
* OBO 1.4 files are human-readable (in addition to machine-readable) and can be opened in any text editor.
* OWL files can be read by <a href="https://protege.stanford.edu/">Protégé</a> text editor.

 In this learning submodule, we will only use ".OBO" to obtain GO terms. The OBO file format is for representing ontologies and controlled vocabularies. The format itself attempts to achieve the following goals:
 * Human readability
 * Ease of parsing
 * Extensibility
 * Minimal redundancy

The file structure is shown in the following figure.


![](./images/Module3/OBO_Format.jpg)

The OBO file has a header, which is an unlabeled section at the beginning of the document. The header ends when the first term is encountered. Next, terms are represented in a labeled section with the tag *[Term]*. Under each term, we can find other information such as term ID, official name, category (namespace), term definition, synonym and relation to other GO terms.

At this step, we still don't know what genes are related to which GO terms. In order to retrieve custom sets of gene ontology annotations for any list of genes from organisms, NCBI has published a Gene2GO database that obtains GO terms and the gene symbols related to those GO terms. The database can be retrieved from <a href="https://ftp.ncbi.nih.gov/gene/DATA/gene2go.gz">here</a>. The Gene2GO database can be viewed using a text editor, the file structure is presented in the figure below:

![](./images/Module3/Gene2GO.jpg)

The OBO and Gene2GO databases will be used in combination to obtain GO terms and related genes for enrichment analysis.

<!-- headings -->
<a id="go-retrieve"></a>
### 1.2. Retrieving GO terms
This section focuses on downloading related GO terms based on the DE gene list obtained from the DE analysis in the previous submodule.
Here, we will use `topGO` and `hgu133plus2.db` R packages to obtain GO terms. The `topGO` package has built-in functions that use Gene2GO databases to retrieve GO terms from the gene ID given by DE analysis.
The installation process of the packages can be done by the script below:

In [None]:
# Installation of topGO and hgu133plus2.db package
suppressMessages({
    if (!require("BiocManager", quietly = TRUE)) {
        install.packages("BiocManager")
    }
    suppressWarnings(BiocManager::install("topGO", update = F))
    suppressWarnings(BiocManager::install("hgu133plus2.db", update = F))
    suppressWarnings(BiocManager::install("AnnotationDbi", update = F))
})

In [None]:
# Importing the library
suppressPackageStartupMessages({
    suppressWarnings(library("topGO"))
    suppressWarnings(library("hgu133plus2.db"))
    suppressWarnings(library("AnnotationDbi"))
})

To get the DE gene list from the DE analysis in the previous submodules, we will first need to download the data saved from the cloud bucket. To do that, we can use the following commands.

In [None]:
# Download the limma_results.rds file to the "data" folder in current directory
system("aws s3 cp s3://your-unique-name/GSE5281.rds ./data")
# Loading the DE result
data = readRDS("./data/GSE5281.rds")
# Get the DE analysis from the loaded data
limma_results <- data$limma_results

In [None]:
# Get a numeric vector with names as gene ID from DE results for retrieving GO terms
geneList <- limma_results$p.value
# Assign gene id as names for the numeric vector
names(geneList) <- rownames(limma_results)

Now, we can search for related GO terms based on the new gene list using `topGO` package. First, we need to create a `topGOdata`
object.

In [None]:
# Retrieve all the GO terms related to the gene list obtained from the expression matrix
GOdata <- new("topGOdata", description = "", ontology = "BP", 
              allGenes = geneList, geneSel = function(x) x, nodeSize = 10, 
              annot = annFUN.org, ID = "alias", mapping = "org.Hs.eg")

We can search for related GO terms using `geneInTerm` function and view the term with associated genes.

In [None]:
# Obtain a list of genes for each GO term
allGO = genesInTerm(GOdata)
# show the first 5 GO terms
allGO[1:5]

Now, we already had GO terms with genes. However, we still do not know the meaning of GO terms related to biological processes. We can use `GO.db` database to get a set of annotation maps describing the entire Gene Ontology assembled using data from GO. We can use the following code to install the `GO.db` R package.

In [None]:
# Install GO.db package
suppressMessages({
    if (!require("BiocManager", quietly = TRUE)) {
        install.packages("BiocManager")
        suppressWarnings(BiocManager::install("GO.db", update = F))
    }
})

# Import GO.db package
library(GO.db)

Then, we can use the following commands to obtain the GO terms description.

In [None]:
# Getting the name of each GO term
terms <- names(allGO)
# Getting the description of each GO term
descriptions <- lapply(Term(terms), function(go_term) go_term[[1]])

In [None]:
descriptions[1:10]

In order to perform pathway analysis in later submodules, we need to save the GO terms using a standard format. One commonly used format is Gene Matrix Transposed file format *(\*.gmt)*. The GMT file format is a tab delimited file format that describes gene sets. In the GMT format, each row represents a gene set; in the GMX format, each column represents a gene set. Here, we can save the GO terms and their associated genes to the *\*.gmt* using the following `writeGMT` function. 

In [None]:
#' Function to save Gene Ontology (GO) terms with gene sets to local storage in GMT format.
#'
#' @param genesets A list where each element corresponds to a gene set, named by the GO term. Each element contains a character vector of gene symbols belonging to that gene set.
#' @param descriptions A list where each element corresponds to a gene set (GO term) description. It is indexed by the GO term names and contains character descriptions.
#' @param outfile The path to the output file where the GMT-formatted data will be saved.
#' @return This function is used for saving GO terms with gene sets to a GMT-formatted file.

writeGMT <- function(genesets, descriptions, outfile = "gene_sets.gmt") {
  # Check if the output file already exists, and remove it if it does
  if (file.exists(outfile)) {
    file.remove(outfile)
  }

  # Loop through each gene set (GO term) in the input genesets
  for (gs in names(genesets)) {
    # Prepare a line to write to the output file
    line <- c(gs, gsub("\t", " ", descriptions[[gs]]), genesets[[gs]])

    # Write the line to the output file in GMT format
    # The line contains GO term, description, and the associated gene symbols
    write(line, file = outfile, sep = "\t", append = TRUE, ncolumns = length(genesets[[gs]]) + 2)
  }
}


# Specify the output file path
outfile <- "./data/GO_terms.gmt"
# Call the writeGMT function to save GO terms with genesets
writeGMT(genesets = allGO, descriptions = descriptions, outfile = outfile)

In [None]:
# save the GO terms with genesets to Amazon S3 Bucket
# replace <BUCKET_NAME> with name of your bucket that was previously made in submodule 1
# system("aws s3 cp ./data/GO_terms.gmt s3://<BUCKET_NAME>", intern = TRUE)
system("aws s3 cp ./data/GO_terms.gmt s3://your-unique-name", intern = TRUE)

<!-- headings -->
<a id="kegg-main"></a>
## 2. Kyoto Encyclopedia of Genes and Genomes (KEGG)
<!-- headings -->
<a id="kegg-overview"></a>
### 2.1. Overview
KEGG is a collection of databases dealing with genomes, biological pathways, diseases, drugs, and chemical substances. KEGG is utilized for bioinformatics research and education, including data analysis in genomics, metagenomics, metabolomics and other omics studies, modeling and simulation in systems biology, and translational research in drug development. The KEGG database project was initiated in 1995 by Minoru Kanehisa, professor at the Institute for Chemical Research, Kyoto University, under the ongoing Japanese Human Genome Program. Foreseeing the need for a computerized resource that can be used for biological interpretation of genome sequence data, he started developing the KEGG PATHWAY database. It is a collection of manually drawn KEGG pathway maps representing experimental knowledge on metabolism and various other functions of the cell and the organism. Each pathway map contains a network of molecular interactions and reactions and is designed to link genes in the genome to gene products (mostly proteins) in the pathway. This has enabled the analysis called KEGG pathway mapping, whereby the gene content in the genome is compared with the KEGG PATHWAY database to examine which pathways and associated functions are likely to be encoded in the genome. KEGG is a "computer representation" of the biological system. It integrates building blocks and wiring diagrams of the system—more specifically, genetic building blocks of genes and proteins, chemical building blocks of small molecules and reactions, and wiring diagrams of molecular interaction and reaction networks. The illustrative structure of KEGG is presented as figure below.
![](./images/Module3/KEGG.jpg)


In [None]:
#Run the following command to take the quiz
IRdisplay::display_html('<iframe src="../Quizzes/Quiz_Submodule3-2.html" width=100% height=250></iframe>')

<!-- headings -->
<a id="kegg-retrieve"></a>
### 2.2. Retrieving pathways from KEGG database
In this section, we will retrieve pathways and related genesets from the KEGG database using R commands. Here we will use the `KEGGREST` R package that provides a client interface to the KEGG REST server. `KEGGREST` can be installed from the Bioconductor using the following command.

In [None]:
# Install KEGGREST package
suppressMessages({
    if (!require("BiocManager", quietly = TRUE)) {
        install.packages("BiocManager")
    }
    suppressWarnings(BiocManager::install("KEGGREST", update = F))
})

In [None]:
# Import KEGGREST package
suppressPackageStartupMessages({
    library(KEGGREST)
})

KEGG contains a number of databases. To get an idea of what is available, run `listDatabases()`:

In [None]:
# List all available databases in KEGGREST
KEGGREST::listDatabases()

We can use these databases in further queries. Note that in many cases you can also use a three-letter KEGG organism code or a “T number” (genome identifier) in the same place you would use one of these database names.

We can obtain the list of organisms available in KEGG with the `keggList()` function:

In [None]:
# Get the list of organisms available in KEGG
organism <- keggList("organism")

In [None]:
print(paste0("KEGG supports ", dim(organism)[1], " organisms"))

To view the supported organisms we can use the following command:

In [None]:
# View several supported organism
head(organism)

In [None]:
IRdisplay::display_html('<iframe src="../Quizzes/Quiz_Submodule3-3.html" width=100% height=250></iframe>')

In submodule 02, we performed DE analysis on a human dataset. Therefore, we need to download pathways for humans. The abbreviation of human pathway in KEGG is `hsa` and we can use the `keggList` function to get the pathway list.

In [None]:
# Obtain the pathways belong to human
pathways.list <- keggList("pathway", "hsa")

The pathway list contains pathway description and pathway code in a single line of text. To see the first five pathways related to **carbohydrate metabolism** we will use the commands below.

In [None]:
#list the specific pathways to view
pathway_ids <- c("hsa00010", "hsa00020", "hsa00030", "hsa00040", "hsa00051")
#view their names
pathways.list[pathway_ids]

To learn more about these pathways you can visit https://www.genome.jp/kegg/pathway.html. 

In [None]:
#Run the following command to take the quiz
IRdisplay::display_html('<iframe src="../Quizzes/Quiz_Submodule3-4.html" width=100% height=250></iframe>')

We can see that, in each line, the text in the quotation mark contains pathway information while the later part contains pathway code leading by a prefix `path:`. To get pathway codes from the pathway list, we can use the following commands:

In [None]:
# Retrieve all the pathway IDs belong to human
pathway.codes <- sub("path:", "", names(pathways.list))
pathway.codes

We can use the following command to check how many pathways are available for human.

In [None]:
print(paste0("Number of available pathways for human are: ", length(pathway.codes)))

We will use the code below to obtain a list of genes and pathway descriptions for all available human pathways. We'll utilize the `keggGet` function for this task. Since this function can provide 10 results at a time, we will pass 10 pathway IDs as input at each step. 

Something to note is that the KEGG database may also provides pathways that hold many sub pathways so when we look up the pathways through keggGet, they may not have any genes listed in that report. This can skew our results so the for loop below takes out any pathway that does not have a Gene section from the list.

In [None]:
# Create indexes for the outer loop
idx4loop <- seq(from = 1, to = length(pathway.codes), by = 9)
idx4loop <- c(idx4loop, length(pathway.codes))
# Create list to store the gene sets
genes.by.pathway <- list()
description.by.pathway <- list()

# Loop through the each index in idx4loop
for (i in 1:length(idx4loop)) {
    if (i < length(idx4loop)) {
        # Get the ten pathways names to query using keggGet
        pathways_names <- pathway.codes[idx4loop[i]: idx4loop[i+1]]
        pw <- keggGet(pathways_names)
        for (j in 1:length(pw)) {
            pw2 <- pw[[j]]
            # If the returned result of a pathway does not have the key "GENE", skip to the next pathway id
            if (is.null(pw2$GENE)) next
            description.by.pathway[[pathways_names[[j]]]] <-  pw2$NAME          

            pw2 <- pw2$GENE[c(FALSE, TRUE)]
            pw2 <- unlist(lapply(strsplit(pw2, split = ";", fixed = T), function(x) x[1]))
            genes.by.pathway[[pathways_names[[j]]]] <- pw2
        } 
    } 
}
                                 

We can view the first five pathways with their genesets using the following command

In [None]:
# View the five pathways with the genesets
genes.by.pathway[1:5]

Use the following command to see the description of the first five pathways

In [None]:
# View the description of the first five pathways
description.by.pathway[names(genes.by.pathway[1:5])]

Taking a look at the file since the first 8 pathways do not have any genes listed we can remove those hsa ids.

In [None]:
# Saving the pathway data to the local repository
outfile <- "./data/KEGG_pathways.gmt"
writeGMT(genesets = genes.by.pathway, descriptions = description.by.pathway, outfile = outfile)

In [None]:
# Saving the pathway information to the Amazon S3 Bucket
# replace <BUCKET_NAME> with name of your bucket that was previously made in submodule 1
# system("aws s3 cp ./data/KEGG_pathways.gmt s3://<BUCKET_NAME>", intern = TRUE)
system("aws s3 cp ./data/KEGG_pathways.gmt s3://your-unique-name", intern = TRUE)

<!-- headings -->
<a id="reactome-main"></a>
## 3. REACTOME Pathway Database
<!-- headings -->
<a id="reactome-overview"></a>
### 3.1. Overview

REACTOME is an open-source, open-access pathway database that provides a valuable bioinformatics resource for the visualization, interpretation and analysis of biological pathways. Established in 2003, it is a collaborative project led by Lincoln Stein of OICR, Peter D’Eustachio of NYU Langone Health, Henning Hermjakob of EMBL-EBI, and Guanming Wu of OHSU. REACTOME focuses on reactions as its core unit, where entities like nucleic acids, proteins and small molecules contribute to form dynamic networks of interactions and are grouped into pathways. Covering species-specific pathways, each step undergoes rigorous experimental verification, ensuring reliability. This curation process mirrors the editing of a scientific review which requires the participation of multiple domain experts. REACTOME annotates human biological processes by breaking them down into molecular events, resembling classical chemical reactions. Each REACTOME event involves input entities interacting to produce output entities, encompassing reactions like metabolic conversions, binding events, complex formation, transport events, and protein activation. Events are organized into pathways and physical entities, whether small (e.g., glucose) or large (e.g., DNA), are cross-referenced to external databases. Since subcellular localization is crucial in human biological process regulations, Molecules within the Reactome database are linked to particular locations. Therefore, REACTOME treats its instances in different locations as distinct entities. This database uses Gene Ontology to control the vocabularies used to describe the subcellular locations of molecules and reactions, molecular functions, and the larger biological processes that a specific reaction is part of.

REACTOME is broadly used by clinicians, geneticists, genomics researchers, and molecular biologists to interpret the findings of high-throughput experimental studies. It is also employed by computational biologists seeking to develop novel algorithms for leveraging knowledge from genetic research. Additionally, systems biologists use REACTOME to construct predictive models of normal and pathological pathways.

<!-- headings -->
<a id="reactome-retrieve"></a>
### 3.2. Retrieving pathways from REACTOME database

In this section, we'll employ the ReactomeContentService4R package to fetch pathways and their corresponding genes. The R package ReactomeContentService4R offers a convenient way to interact with the Reactome Content Service API. Its predefined functions enable users to access information and images containing proteins, pathways, and other molecules associated with a particular gene or entity in Reactome.

We first need to install and import the `ReactomeContentService4R` package. If you encounter any error while installing this package in the notebook, you can still install it via the terminal. Normally, this package installation requires the `magick` package to be installed. We can either install the `magick` package via the terminal using this command: ```sudo apt-get install -y libmagick++-dev``` or use the `system` function to install it inside the notebook. To create a terminal in a Vertex AI notebook, you can click on File -> New -> Terminal. We will use the `system` function to install the package as follow:

**Note**: Please be aware that the installation instructions provided here might change in the future with the release of new versions of the software or packages. It is advisable to check for the most up-to-date installation instructions on [CRAN official document page](https://cran.r-project.org/web/packages/magick/vignettes/intro.html). This ensures that you have the latest and most accurate information for installing the package. 


In [None]:
# Install the magick package
system("sudo apt-get install -y libmagick++-dev", intern = TRUE, ignore.stdout = TRUE)

In [None]:
# Install and load required packages with suppressWarnings and suppressMessages

suppressWarnings({
    suppressMessages({
        if (!require("BiocManager", quietly = TRUE))
            install.packages("BiocManager")

        BiocManager::install("ReactomeContentService4R", force = TRUE)
    })
})

# Import the ReactomeContentService4R package
library("ReactomeContentService4R")

To retrieve the infomation of pathways of specific species, we could use the `getSchemaClass` function. The `getSchemaClass` function fetch Instances by Class, to use this function, we need to pass the following arguments:

- `class`: A character string specifying the schema class name.
- `species`: A character string specifying the name or taxon id or dbId or abbreviation of species. Only Event and PhysicalEntity classes can specify species
- `all`: A boolean to decide to return ALL entries or not, default is FALSE. 

This function returns a sorted dataframe containing entries that belong to the specified schema class.

In the following code, we are getting all human pathways information.

In [None]:
# Retrieve the pathways information using the getSchemaClass function
pathways <- getSchemaClass(class = "Pathway", species = "human", all = TRUE)
pathways[1:5]

We will select only the pathway ID and their associated name column from the information dataframe.

In [None]:
# Get the pathway ID and the pathway description from the pathways information
reactome_pathways <- pathways[, c("stId", "displayName")]
# Show some first row of the result table
head(reactome_pathways)

Next, We will get all genes ID for each pathway and store them in a list just like what we did with KEGG and GO Terms. To do that, we can use the code below.

In [None]:
# Create the geneset list 
rgenes.by.pathway <- sapply(reactome_pathways$stId, function (pathwayID) {
    genesID <- event2Ids(event.id = pathwayID)
    geneSymbol <- genesID[["geneSymbol"]]
    geneSymbol
})

# Get the description of the pathways list
rdescription.by.pathway <- as.list(reactome_pathways$displayName)
names(rdescription.by.pathway) <- reactome_pathways$stId

In [None]:
# Print out some first pathways and their genes IDs
rgenes.by.pathway[1:5]

In [None]:
# Print out some first pathways and their descriptions
rdescription.by.pathway[1:10]

Now, we can save the REACTOME pathways to a gmt file for later uses with the `writeGMT` function.

In [None]:
# Saving the pathway information to the local repository
outfile <- "./data/REACTOME_pathways.gmt"
writeGMT(genesets = rgenes.by.pathway, descriptions = rdescription.by.pathway, outfile = outfile)


In [None]:
# Saving the pathway information to the Amazon S3 Bucket
# replace <BUCKET_NAME> with name of your bucket that was previously made in submodule 1
# system("aws s3 cp ./data/REACTOME_pathways.gmt s3://<BUCKET_NAME>", intern = TRUE)
system("aws s3 cp ./data/REACTOME_pathways.gmt gs://your-unique-name", intern = TRUE)

In the next submodule, we will do Pathway Analysis.


## Conclusion

This notebook provided a comprehensive guide to retrieving and processing pathway information from three major biological databases: Gene Ontology (GO), Kyoto Encyclopedia of Genes and Genomes (KEGG), and REACTOME.  We explored the structure and organization of each database, learned how to access and query them using R, and importantly, formatted the resulting pathway gene sets into `.gmt` files for downstream analysis.  These curated gene sets, now stored locally and in Google Cloud Storage, are ready to be utilized for pathway enrichment analysis in the subsequent submodule, enabling us to interpret the biological significance of differentially expressed genes identified in previous analyses.  Remember that these databases are regularly updated; consider refreshing your downloaded data periodically to leverage the most current pathway annotations.

## Clean Up

Remember to move to the next notebook or shut down your instance if you are finished.

---

In [None]:
sessionInfo()