# Processing Expression Data

![](./images/Module1/Data_Processing.png)

## Overview

This Jupyter Notebook covers processing gene expression data, specifically focusing on datasets from the NCBI Gene Expression Omnibus (GEO) database.  It details how to browse, download, and process data from GEO using both the web interface and the `GEOquery` R package, using GSE5281 (microarray) and GSE153873 (RNA-Seq) as examples.  The notebook explains data uploading methods for user-provided data, including uploading to the Vertex AI instance and Cloud Storage Bucket.  Data processing steps are outlined, including normalization, sample condition extraction, and gene ID mapping, with distinct code examples for microarray and RNA-Seq data.  Finally, it describes how to store processed data in an Amazon S3 bucket and export data in CSV and RDS formats.

## Learning Objectives:
1. Explore data that is accessible from Gene Expression Omnibus (GEO) database.
2. Use web-interface and R command line to download data from GEO.
3. Use web interface to upload experiment data to the Cloud environment
4. Learn data processing and gene mapping.
5. Store data onto Amazon S3 bucket and export data in CSV or RDS format.

## Prerequisites

* **R:**  The core requirement.
* **R Packages:**  The notebook explicitly installs or uses the following:
    * `GEOquery` (from Bioconductor)
    * `BiocManager` (for managing Bioconductor packages)
    * `IRdisplay` (for displaying interactive elements, likely quizzes)
    * `hgu133plus2.db` (annotation package, specific to the GPL570 microarray platform)
    * `org.Hs.eg.db` (annotation package for human gene mappings)
    * `readr` (for writing CSV files, though `write.csv` is base R and likely sufficient)

## Get Started

### Table of Contents
1. [Browsing and Downloading from NCBI GEO](#dt-query)
    - 1.1. [Downloading using Web Interface](#dt-webquery)
    - 1.2. [Downloading using R Command Line](#dt-Rquery)
2. [Uploading User-Provided Data](#dt-manual)
3. [Data Processing and Gene Mapping](#dt-process)
    - 3.1. [Example of Microarray Data](#dp-microarray)
    - 3.2. [Example of RNA-Seq data](#dp-rnaseq)
4. [Storing Data to Cloud Storage Bucket](#dt-cloud)
5. [Exporting Data](#dt-export)

<!-- 1. [Manually Upload Data to Cloud](#dt-manual)
2. [Query Public Data Using R Command Lines:](#dt-query)
3. [Process Data For Downstream Analysis](#dt-process)
   - 3.1. [Data normalization](#dp-dtnorm)
   - 3.2. [Samples Condition Extraction](#dp-samplextract)
   - 3.3. [Gene IDs Conversion](#dp-convert)
4. [Exporting the Data](#dt-export)
5. [Storing files in a AWS S3 bucket](#dt-cloud) -->

<!-- headings -->
<a id="dt-query"></a>
## 1. Browsing and Downloading from NCBI GEO

The Gene Expression Omnibus (GEO) is a public repository that accumulates and serves gene expression data, such as microarray, next-generation sequencing, and other forms of high-throughput functional genomic data, from thousands of studies submitted by the scientific community. 
The data come with written descriptions of experimental design, sample characteristics, and methodology for studies of high-throughput gene expression and genomics. 

<!-- headings -->
<a id="dt-webquery"></a>
### 1.1. Downloading using Web Interface
Browsing the content on the GEO website is user-friendly and relatively straightforward. First, users need to navigate to <a href="https://www.ncbi.nlm.nih.gov/geo/"> https://www.ncbi.nlm.nih.gov/geo/</a>. The GEO website interface is shown in the figure below:

![](./images/Module1/GEO_Website.png)

The most straight-forward is to click on the contents under the `Browse Content` column.
For example, if we click on the `Series`, we can see the following web page:

![](./images/Module1/GEO_Website_Screening.png)

From the figure, we can see the list data series that are available on GEO, provided with basic information about the series including accession ID, title, type of sequencing platform, organism, number of samples, referenced dataset ID, list of supplementary files, contact person and release data. Of note, the `Supplementary` column also specifies whether there are available raw data in the series. We can apply the filter in these columns to find the datasets that matched our research. Given the data series of interest, we can click on the accession number to further explore the data. 

Alternatively, if we know the accession number, we can provide it to the search box on the homepage. Now, we will search for the two datasets that we are going to use in the learning module.
The GEO website interface with the searching procedure of the example dataset is shown in the figure below:

![](./images/Module1/GEO_Website_Searchbox.png)

When the searching process is done, a webpage with a detailed record of the example dataset such as published date, title, organism, experiment type, dataset summary, etc. will be shown in the figure below:

![](./images/Module1/GEO_Dataset_Page.png)

At the bottom of the dataset page, users will find additional information about the dataset such as sequencing platform, number of samples, project ID, and links to download the expression data. Users can also click the `(http)` hyperlink to download all the samples or click `(custom)` to select and download the samples of interest. Note that, expression data downloaded at this step may be raw data and additional data processing needs to be done locally for further analysis.

To display the quiz in all the learning sub-modules, it is necessary to have the `IRdisplay` package pre-installed.
This package allows quizzes written in `html` format to show up in the notebook. Users can install the `IRdisplay` using the following command:
```
suppressWarnings(if (!require("IRdisplay")) install.packages("IRdisplay"))
suppressWarnings(library(IRdisplay))
```

In [None]:
# Run the following command to take the quiz
IRdisplay::display_html('<iframe src="./Quizzes/Quiz_Submodule1.html" width=100% height=250></iframe>')

<!-- headings -->
<a id="dt-Rquery"></a>
### 1.2. Downloading using R Command Line
Users can also using R to query expression data from GEO by using R package specifically built for querying data from the database. In this section, we will download and process the two Alzheimer's datasets from GE), with accession numbers: GSE5281 (Microarray) and GSE153873 (RNA-Seq), using the `GEOQuery` R package. For other databases, we suggest users to look for the designated packages on many R communities such as CRAN or BioConductor.

Before starting, users will need to install the `GEOquery` package using the following command.

In [None]:
# Install required package
suppressMessages({
    
    if (!require("BiocManager", quietly = TRUE)) {
        suppressWarnings(install.packages("BiocManager"))

    }  
    suppressWarnings(BiocManager::install("GEOquery", update = F))
})

# Check if the package is installed
suppressMessages(library("GEOquery"))

#### Download Microarray Dataset: GSE5281
We can use the `getGEO` function from the `GEOquery` package to download GEO dataset. First, users have to specify the accession ID of the dataset. For this demonstration, we will use the same dataset `GSE5281`.

In [None]:
# Specify GEO accession ID
accession_ID <- "GSE5281"


# Specify directory to save the data
save_Path <- "./data"

# Create the data folder
dir.create(save_Path, recursive = TRUE, showWarnings = FALSE)

# Download the data
suppressMessages({
    gse <- getGEO(GEO = accession_ID, destdir = save_Path)
})

To use the `getGEO` function, you need to pass the following arguments:

- `GEO`: A character string representing the GEO accession ID
- `destdir`: A character string representing the destination directory to save the downloaded data.

The `getGEO` function will return a list of `ExpressionSet` objects. This list can contain more than 1 object. It is because some datasets on GEO may be derived from different microarray platforms. Therefore, each object in the returned list is with repect to data generated from a particular platform.
We can find out how many platforms were used by checking the length of the `gse` object.

In [None]:
# Check how many platforms used
message(paste0("Number of platforms: ", length(gse)))

The result shows that we have only one dataset that belongs to the microarray platform mentioned GEO dataset page.
Next, we can access the gene expression matrix, samples and genes information using the specific accesor functions as follows:

In [None]:
# Extract the dataset from the gse object
data <- gse[[1]]

# Access to the gene expression matrix
GSE5281Exprs <- exprs(data)

# Access to the samples information
GSE5281Samples <- pData(data)

# Check the number of samples and genes
print(paste0("The dataset contains ", dim(GSE5281Exprs)[2] , " samples and ", dim(GSE5281Exprs)[1], " genes"))

We can check the data tables we have just accessed by specifying the indexes of rows and columns as follows:

In [None]:
message("The example of expression matrix")
GSE5281Exprs[1:10, 1:10]

In [None]:
message("Example of sample information table")
GSE5281Samples[1:10, 1:10]

The ```GSE5281Samples``` contains the metadata of each sample such as title, status, GEO accession, submission data, etc. 

In [None]:
# Run the following command to take the quiz
IRdisplay::display_html('<iframe src="./Quizzes/Quiz_Submodule1-1.html" width=100% height=250></iframe>')

#### Download RNA-Seq Dataset: GSE153873

Microarray data deposited on GEO are usually processed by the authors. In contrast, RNA-Seq data deposited on GEO database are read count matrix without any data normalization. Therefore, there are no `Series Matrix File(s)` available to download. However, the count matrix file is saved as **Supplementary File**. To download it, we will use the function ```getGEOSuppFiles``` instead.

In [None]:
# Specify GEO accession ID
accession_ID <- "GSE153873"

# Specify directory to save the data
save_Path <- "./data/GSE153873"

# Download supplentary files
tmp <- getGEOSuppFiles(GEO = accession_ID, baseDir = "./data", fetch_files = TRUE)

# Check files in the directory
list.files(save_Path)

In this code snippet, we use the function getGEOSuppFiles() to download the supplementary files of the dataset GSE153873. The function has the following parameters: 
- `GEO` – a character parameter that specifies the GEO accession number,
- `baseDir` – a character parameter that specifies the directory for downloaded data, and
- `fetch_files` – a logical parameter, with TRUE means telling the function to actually download the files and FALSE telling the function to just return the filenames that would have been downloaded.

The function returns a data frame in which row names represent the full path to the downloaded files. . We can check the downloaded file in this folder using the function `list.files()`. As we can see in the console output, the raw count data is saved under the name `GSE153873_summary_count.star.txt.gz`. To get the data matrix, users need to run the following command lines:

In [None]:
# Get the path to the count matrix file:
countsFile <- file.path(save_Path, "GSE153873_summary_count.star.txt.gz")

# Read the count matrix file:
GSE153873Exprs <- read.table(countsFile, header = TRUE, sep = "\t", fill = 0, row.names = 1, check.names = FALSE)

# Examine the RNASeqExprs:
message("Examine the read count matrix")
GSE153873Exprs[1:10, 1:10]

As we can see, the count matrix file has rows as genes and columns as samples. Similar to the microarray dataset, we use the `getGEO` function to get the sample information as follows:

In [None]:
# Download the dataset
suppressMessages({
    gse <- getGEO(GEO = accession_ID, destdir = save_Path)
})

# Extract the dataset from the gse list:
data <- gse[[1]]

In [None]:
# Access to the sample information table:
GSE153873Samples <- pData(data)

# Examine the RNASeqSamples
message("Examine the sample information")
GSE153873Samples[1:10, 1:10]

<!-- headings -->
<a id="dt-manual"></a>
## 2.  Uploading User-Provided Data

Users can also directly upload their own data or data downloaded from public databases such as GDC The Cancer Genome Atlas (TCGA), ArrayExpress, etc. to the Cloud environment. However, it is important to note that the submodule is designed to handle gene expression data generated from either microarray or RNA-Seq experiments. For those starting with raw sequencing files (.CEL for microarray or .FASTQ for RNA- Seq), we recommend consulting relevant protocols for alignment and obtaining the expression data table. Accordingly, this section requires the following file types as input:

- A gene expression/read count matrix file. The matrix file is a table that contains the gene expression/read counts for each gene in each sample. The matrix can be saved in any format, e.g., TXT, CSV, TSV, etc, depending on what data processing pipelines that users use to generate the matrix file. The rows in the matrix are feature IDs (e.g., probe IDs, Ensemble ID, etc.) while the columns are samples. An example of gene expression matrix is shown as below, in which columns are sample IDs and rows are Ensemble ID:

![](./images/Module1/User_Example_Exprs.png)

-  A spreadsheet containing sample information, which can be CSV or TSV format. In this spreadsheet, each row represents a sample, and each column represents its attribute, e.g., sample ID, vital status, tissue, platform, etc. An example of this spreadsheet is shown as below, in which users can use `vital_status` as sample conditions for differential analysis:

![](./images/Module1/User_Example_Sample.png)


### 2.1. Upload to Notebook Instance
Users can upload the data directly to the cloud by simply using the user interface of this Jupyter Notebook. The instructions are shown in the following figure:
![](./images/Module1/Data_Uploading_VAI.png)

### 2.2. Upload to Cloud Storage Bucket
Alternately, users can also upload their data to the cloud AWS S3 Bucket. The data may be lost after users delete the notebook instance, so storing them to the S3 bucket allows users to use the data anytime they want. The instructions to upload the data to the Cloud Storage Bucket are shown in the following figure:
1. On the webpage of AWS account, search for [S3 service](https://console.aws.amazon.com/s3).
2. Select `Create bucket` button to create a new bucket.

![](./images/Bucket/bucket1.png)

3. Enter the information such as a unique name, region, etc, required for creating the bucket. Note that users can define the access control in this step, and edit the access later once users wants to share their data.

![](./images/Bucket/bucket2.png)

4. Click `Create bucket` once the all required information are provided.

![](./images/Bucket/bucket3.png)

5. The dashboard shows all the buckets users created. Click on the bucket's name:

![](./images/Bucket/bucket4.png)

7. Click on the *Upload*:

![](./images/Bucket/bucket5.png)

8. Start adding your files to the bucket:

![](./images/Bucket/bucket6.png)

Once the uploading is done, users can simply load the data to their instance by running the following command syntax in R code block: `system("aws s3 cp s3://<BUCKET-NAME>/<FILE-NAME> <DESTINATION>")`. For example, we run the following command lines to load the dataset GSE5281 we stored in the Cloud Storage Bucket:

```
# Download the files from S3 Bucket to the "data" folder in current directory
system("aws s3 cp s3://your-unique-name/GSE5281.csv ./data/")
system("aws s3 cp s3://your-unique-name/GSE5281_SampleInfo.csv ./data/")
```

<!-- headings -->
<a id="dt-process"></a>
## 3. Data Processing and Gene Mapping

Once the data is downloaded, we need to perform data processing to prepare the data for differential analysis and pathway analysis. The data processing includes the following steps: (i) Data normalization, (ii) Sample condition extraction, and (iii) Gene IDs mapping.

<!-- headings -->
<a id="dp-microarray"></a>
### 3.1. Example of Microarray Dataset: GSE5281

#### Data normalization

Typically, data normalization can be performed for quality assurance. However, different downstream analysis methods may require different data normalization techniques. Therefore, we suggest users to consult these methods' document to choose the most appropriate normalization technique. For example, if a method requires the expression to be in log-scale normalization, we can check the scale the data as follows:

In [None]:
# Show a summary of the expression data using the summary function
summary(GSE5281Exprs[, 1:5])

In [None]:
# Show value range of the expression data using the range function
range(GSE5281Exprs)

From the summary of the data above, we can clearly see that the maximum expression values can be in the scale of thousands, while the average expression values in each sample are below one. One common step is to perform quartile filtering to remove the outlier and missing expression values. Also, we will need to perform a transformation to ensure the distributions of all samples are similar. Then, a `boxplot` can also be generated to see if the data have been correctly normalized. We can use the sample code below to perform all of those steps.

In [None]:
# Calculate the data quantile and remove the NA value
qx <- as.numeric(quantile(GSE5281Exprs, c(0., 0.25, 0.5, 0.75, 0.99, 1.0), na.rm = T))
# Define LogC variable (boolean) to decide whether or not to perform the log transformation
# If 99% of the data > 100 or (range > 50 and 25% of the data > 0), LogC = True and perform log transformation
LogC <- (qx[5] > 100) ||
    (qx[6]-qx[1] > 50 && qx[2] > 0)
# Replace negative values with NA and perform log transformation if logC is True
if (LogC) {
    GSE5281Exprs[which(GSE5281Exprs <= 0)] <- NaN #
    GSE5281Exprs <- log2(GSE5281Exprs+1)
}
# Plot the boxplot of 10 samples
boxplot(x = GSE5281Exprs[, 1:10], outline = FALSE)

In [None]:
# Run the following command to take the quiz
IRdisplay::display_html('<iframe src="./Quizzes/Quiz_Submodule1-2.html" width=100% height=250></iframe>')

#### Samples Condition Extraction
In pathway analysis, it is crucial to determine which conditions of sample are being compared to one another. 
. In our example, we are comparing gene expression between two patient groups: *normal* - *control* (c) and *Alzheimer’s* - *disease* (d). To this aim, users can execute the following code snippets

In [None]:
# Add a column specifying the condition of each sample (normal - c or Alzheimer's - d)
GSE5281Samples$condition <- ifelse(grepl("normal", GSE5281Samples$characteristics_ch1.8), "c", "d")

# Factorize the new column
GSE5281Samples$condition <- factor(GSE5281Samples$condition)

# Add a new column to specify the region of the sample tissue,
# use make.names() to remove special characters and
# use tolower() to make all characters lowercase
GSE5281Samples$region <- make.names(GSE5281Samples$characteristics_ch1.4)
GSE5281Samples$region <- tolower(GSE5281Samples$region)

# Factorize the newly added column
GSE5281Samples$region <- factor(GSE5281Samples$region)

# Reorder the samples to match the samples order in the expression data
GSE5281Samples <- GSE5281Samples[order(match(rownames(GSE5281Samples), colnames(GSE5281Exprs))), ]

The samples of the GSE5281 dataset fall into two conditions: *normal* and *Alzheimer’s,* which are specified in the `characteristics_ch1.8` column. Each sample is also associated with a specific brain region, such as the entorhinal cortex, hippocampus, primary visual cortex, and so on, denoted in the `characteristics_ch1.4` column. Consequently, both attributes serve as conditions to determine the groups of patients.

The initial step in the code snippet involves the addition of two new columns that represent the sample’s condition and the associated brain region to the sample information. These new columns are essentially cleaner versions of the original `characteristics_ch1.8` and `characteristics_ch1.4` columns. The original columns are often manually curated and may contain special characters or duplicated data, which could potentially lead to errors in the analysis. Hence, it is crucial to perform data cleaning before proceeding with any further steps. We can check the statisitcs of the two new columns as follows:

In [None]:
# Examine the newly added columns
summary(GSE5281Samples[, c("condition", "region")])

#### Gene Mapping

In this section, we will create a gene IDs conversion data frame that will be used to convert the gene IDs used in the dataset into the common IDs such as Symbol or Entrez ID. It will be useful in the gene set or pathway analysis, which requires that gene sets and expression data use the same type of gene IDs (e.g., usually Entrez Gene ID or Gene symbol). Dependent on the platform used to generate the expression data, there exists an R package that can be used to annotate the genes in the data to Entrez Gene ID or Gene symbol. For example, the platform of the dataset GSE5281 is GPL570, which has an annotation package on Bioconductor named *hgu133plus2.db*. The list of available annotation packages can be found at: <a href="https://bioconductor.org/packages/3.18/data/annotation/"> https://bioconductor.org/packages/3.18/data/annotation/</a>.

To create the gene IDs mapping data frame, users can execute the following code snippets:

In [None]:
# Install the genome wide annotation database for human

suppressMessages({
    suppressWarnings({
        if (!require("BiocManager", quietly = TRUE))
            install.packages("BiocManager")
            BiocManager::install("hgu133plus2.db", update = F)
    })
})
# Load the hgu133plus2.db
library(hgu133plus2.db)

# Mappping for GSE5281
GSE5281Genes <- rownames(GSE5281Exprs) 
GSE5281GenesMapping <- AnnotationDbi::select(x = hgu133plus2.db, 
                                        keys = GSE5281Genes, 
                                        columns = c("PROBEID", "SYMBOL")) 
colnames(GSE5281GenesMapping) <- c("FROM", "SYMBOL")

To perform the gene IDs mapping, we utilize the `select()` function from the AnnotationDbi package to query various genome-wide annotation databases and to convert gene IDs to the desired IDs. This function requires the following parameters: `x` – an *AnnotationDb* object such as *hgu133plus2.db*, `keys` – a vector containing the current IDs, and `columns` – a vector specifying which types of data (i.e., ID types) can be returned as output. In our example, we choose to return: PROBE ID, SYMBOL, and ENTREZ ID. We then rename the columns of the returned data frame as "FROM" and "SYMBOL", respectively. Users can examine this dataframe as follows:

In [None]:
# Examine the mapping dataframe for GSE5281
head(GSE5281GenesMapping)

To map the PROBEID to gene SYMBOL, we can use the `map_identifiers` function below as follows.

In [None]:
#' @description This function maps identifiers in a dataframe using a mapping dataframe.
#'
#' @param data_df The dataframe containing the data to be mapped.
#' @param mapping_df The dataframe containing the mapping information.
#' @param data_source_col The column name in data_df containing the identifiers to be mapped (default: "PROBEID").
#' @param data_target_col The column name to use for the mapped results in the output dataframe. If NULL, it uses the same name as data_source_col.
#' @param data_result_col The optional column name to use for the mapped results in the output dataframe. If provided, it will replace data_target_col.
#' @return A dataframe with mapped identifiers.
#'
# Function to map identifiers, such as probe IDs, to gene symbols using a mapping dataframe.
map_identifiers <- function(data_df, mapping_df, data_source_col = "PROBEID", data_target_col = "SYMBOL", data_result_col = NULL) {

    # Merge data_df with mapping_df based on data_source_col
    data_df = merge(mapping_df, data_df, by = data_source_col)
    # Remove rows with NA values in the data_target_col
    data_df <- data_df[!is.na(data_df[, data_target_col]), ]
    # Remove duplicated gene symbols, keeping the first occurrence
    data_df <- data_df[!duplicated(data_df[[data_target_col]], fromLast = FALSE), ]
    # Set row names to the values in data_target_col
    rownames(data_df) <- data_df[[data_target_col]]
    
    # Drop columns from mapping_df that are merged into the result dataframe
    if (!is.null(data_result_col)) {
        data_df[[data_result_col]] <- data_df[[data_target_col]]
        
        # Check if data_result_col is the same as data_target_col
        if (data_result_col == data_target_col) {
            data_df <- data_df[, !(names(data_df) %in% colnames(mapping_df)[1:2])]
        } else {
            data_df <- data_df[, !(names(data_df) %in% colnames(mapping_df))]
        }  
    }
    
    return(data_df)
}

In [None]:
# Show the original probe id 
head(GSE5281Exprs)

In [None]:
# Convert the gene expression to data.frame
GSE5281Exprs <- as.data.frame(GSE5281Exprs)
# Create a column to contain the gene id (name should match the column in the mapping table)
GSE5281Exprs$FROM <- rownames(GSE5281Exprs)
# Use the map_identifiers to map the current gene id to the target gene id
GSE5281Exprs <- map_identifiers(data_df = GSE5281Exprs, mapping_df = GSE5281GenesMapping, 
                         data_source_col = "FROM", data_target_col = "SYMBOL", data_result_col = "SYMBOL")

In [None]:
# Show the gene expression data after mapping
head(GSE5281Exprs)

<!-- headings -->
<a id="dp-rnaseq"></a>
### 3.2. Example of RNA-Seq dataset: GSE153873

**Notes:** Most of the packages for analyzing RNA-Seq data embed a normalization process in their functions. Therefore, we will not perform data normalization for the GSE153873 dataset we have just downloaded.

#### Sample condition extraction

In [None]:
# Add a column specifying the condition of the sample,
# which can be either normal - c or alzheimer - d
RNASeqSampleConditions <- ifelse(grepl("Alzheimer", GSE153873Samples$characteristics_ch1.1), "d", "c")

# Factorize the newly added column
GSE153873Samples$condition <- factor(RNASeqSampleConditions)

# Reorder the samples to match the samples order in the expression data
GSE153873Samples <- GSE153873Samples[order(match(GSE153873Samples$title, colnames(GSE153873Exprs))), ]

# Examine the newly added column
summary(GSE153873Samples[, c("condition")])

For this dataset, the `characteristics_ch1.1` column defines the two conditions of the samples: normal and Alzheimer’s. Therefore, we added the `condition` into the `GSE153873Samples` to represent the groups of patients.

#### RNA-Seq Dataset: GSE153873

To create the gene IDs mapping for this dataset, we can run the following command lines:

In [None]:
# Install the genome wide annotation database for human
suppressMessages({
    suppressWarnings({
        if (!require("BiocManager", quietly = TRUE))
            install.packages("BiocManager")
        BiocManager::install("org.Hs.eg.db")
    })
})
# Import the annotation database
library(org.Hs.eg.db)

# Get current gene IDs used in RNA-Seq dataset
GSE153873Genes <- rownames(GSE153873Exprs)

# Create a mapping dataframe
GSE153873GenesMapping <- AnnotationDbi::select(
    x = org.Hs.eg.db, 
    keys = GSE153873Genes, 
    keytype = "SYMBOL", 
    columns = c("SYMBOL"))

GSE153873GenesMapping <- data.frame(
    FROM = GSE153873GenesMapping[,"SYMBOL"],
    SYMBOL = GSE153873GenesMapping[,"SYMBOL"])

GSE153873GenesMapping[1:10,]

For this dataset, gene symbols are used as the IDs. To create the mapping dataframe, we need to load the package `org.Hs.eg.db` package for human’s genome-wide annotation. Users can install this package from Bioconductor following the provided instructions in our code snippet. After installation, it is necessary to load this database and input it as a parameter for the `select(`) function, whose parameters are set to be: `x` – an AnnotationDb `org.Hs.eg.db`, `keys` – a vector containing the current IDs `GSE153873Genes`, `keytype` – a character parameter indicating the type of current gene IDs `SYMBOL`, and `columns` – a vector specifying which types of data are returned as output.

<!-- headings -->
<a id="dt-cloud"></a>
## 4. Storing Data to Amazon S3 Storage
To run `aws s3` command, the user can either create a code cell within the Jupyter notebook by clicking the + button at the top or 
click on File -> New -> Terminal to open the terminal 
 
 
To create buckets in R, it is best to use the system command that allows you to run bash commands in R. One thing to note is that S3 bucket names must be globally unique. Please check out the rules [here](https://docs.aws.amazon.com/AmazonS3/latest/userguide/bucketnamingrules.html).

In [None]:
# Set our bucket name, remember it needs to be unique
# Replace <BUCKET_NAME> with the name of your bucket
system("aws s3 mb s3://your-unique-name", intern = TRUE)

The command to copy data from the local disk to the S3 bucket is as follows:

`aws s3 cp PATH_TO_LOCAL_FOLDER s3://PATH_IN_S3_BUCKET`

For example, if we want to copy the file `./data/GSE5281.rds` we can use the following command.

In [None]:
# Putting expression data, sample groups, and gene IDs mapping into a list
GSE5281Dataset <- list(expression_data = GSE5281Exprs, 
                       samples = GSE5281Samples, 
                       genes = GSE5281GenesMapping)

GSE153873Dataset <- list(expression_data = GSE153873Exprs, 
                       samples = GSE153873Samples, 
                       genes = GSE153873GenesMapping)

# Save the data to the local disk using rds format
saveRDS(GSE5281Dataset, file = "./data/GSE5281.rds")
saveRDS(GSE153873Dataset, file = "./data/GSE153873.rds")

In [None]:
# Replace <BUCKET_NAME> with name of your bucket that was perviously made
system("aws s3 cp ./data/GSE5281.rds s3://your-unique-name", intern = TRUE)

We can download files from AWS S3 Bucket using the following command. Here we are downloading the data to current directory of this notebook.

In [None]:
system("aws s3 s3://your-unique-name/GSE5281.rds ./")

Now we can try to load the data file that we have just downloaded from S3 bucket.

In [None]:
# Load the data downloaded from S3 bucket in current directory
data <- readRDS("./GSE5281.rds")
# Print out the keys of the data list
names(data)

 We will also save the data for the `GSE153873` dataset.

In [None]:
# Replace <BUCKET_NAME> with name of your bucket that was perviously made
system("aws s3 cp ./data/GSE153873.rds s3://your-unique-name", intern = TRUE)

<!-- headings -->
<a id="dt-export"></a>
## 5. Exporting Data

When we have successfully processed expression data, we can export the expression data to a `.csv` file format for inspection in other software such as Excel using the `write_csv` function from readr package. In the code below, we will save the raw expression matrix, normalized expression matrix, and grouping information to `.csv` files.

In [None]:
# Convert raw and normalized expression matrix to data frames and save them to csv files
expression_data <- as.data.frame(GSE5281Exprs)

# Create a sub-directory data folder to save the expression matrix if it is not available
dir <- getwd()
subDir <- "/data/export/"
path <- paste0(dir, subDir)
# check if the saving folder exists
if (!file.exists(path)){
    dir.create(file.path(path))
}
# Save expression values and group to the csv files format in the local folder
write.csv(expression_data, file = "./data/export/GSE5281.csv")
write.csv(GSE5281Samples, file = "./data/samples_GSE5281.csv")
write.csv(GSE5281GenesMapping, file = "./data/genes_GSE5281.csv")

The `.csv` format is a very simple format that might not suitable to store big datasets. We can export the expression data to `.rds` format, which is more memory efficient for loading and saving the data. We can save all the relevant data in a list and write to the disk using the built in `saveRDS` function, as shown in the previous step.

## Conclusion

This notebook provided a comprehensive guide to accessing, processing, and managing gene expression data for downstream analysis.  We explored methods for downloading data from the NCBI GEO database using both the web interface and the `GEOquery` R package.  We demonstrated data uploading procedures, both directly to the Vertex AI instance and to a persistent Cloud Storage bucket for later retrieval.  Crucially, the notebook detailed data processing steps, including normalization, sample condition extraction, and gene ID mapping, using examples for both microarray (GSE5281) and RNA-Seq (GSE153873) datasets.  Finally, we covered exporting processed data in both CSV and RDS formats, ensuring compatibility with various analysis tools and efficient data storage. These steps provide a robust foundation for subsequent analyses, such as differential expression and pathway enrichment analysis, which will be explored in further modules.

## Clean Up

Remember to move to the next notebook or shut down your instance if you are finished.

---

In [None]:
sessionInfo()