# Gene Expression Data Quantile Normalization
### Laurence Nickel (i6257119)

Libraries used: 
* limma (version: '3.54.2')
* preprocessCore (version: '1.60.2')
* data.table (version: '1.14.8')
* plotly (version: '4.10.1')

References:
* [1] Zhao, Y., Wong, L., & Goh, W. W. B. (2020). How to do quantile normalization correctly for gene expression data analyses. *Scientific Reports, 10*(1): 15534. doi: https://doi.org/10.1038/s41598-020-72664-6.

## Introduction

Within this notebook, we will quantile normalize the gene expression data after having splitted the data into training and test sets for four different splits within the notebook 'Training and Test Set Division.ipynb'. At this point we have the following two data files for the gene expression data (which are present in the directory 'data_directory_final_datasets' defined later in the notebook):
* 'gene_expression_data_final.csv' file
* 'gene_expression_data_log2_transformed_final.csv' file

For both of these files quantile normalization will be performed. Mind that we have already performed the normalization within the processing of the methylation data present in the notebook 'Further Processing Methylation Files Part 1.ipynb'. This was possible since the normalization techniques used were Noob and BMIQ normalization, which are explained more in detail in the previously mentioned notebook, which are both within-sample normalization methods meaning that no information leakage between the training and test set (due to normalization) can take place. This is also closely related to why we perform the normalization for the gene expression data files here as the normalization method used is quantile normalization which is not a within-sample normalization method. This means that if we were to perform this quantile normalization before splitting the data into the training and test sets, the information from the test set could potentially influence the normalization process, compromising the independence of the test set [1]. This information leakage would result in an overestimation of the model's performance when evaluated on the test set. Therefore, it is crucial to normalize the gene expression data separately within the training set and the test set. This can be achieved by normalizing the samples that are present in the training set from which a reference sample is retrieved. The samples in the test set are then normalized using this reference sample. Quantile normalization aligns the distribution of the gene expression values between different samples by matching their quantiles [1]. The normalized gene expression data files will be stored separately such that we can still experiment with both the normalized and not normalized ones within the notebook 'Linear Regression for Testing the Datasets.ipynb' present in the 'Machine Learning Algorithms' folder.

To achieve this quantile normalization, in addition to the two previously mentioned gene expression data files being loaded into this notebook, the files 'fold_assignments_samples.csv' and 'training_and_test_assignments_samples.csv' will be loaded into this notebook to retrieve which sample belongs to which set for each of the splits.

### Importing libraries

Before we start the quantile normalization process of the gene expression data files, we should first import some libraries that will be used throughout this notebook. These libraries can be installed through 'BiocManager'.

In [1]:
# Checking whether the package 'BiocManager' has already been installed and installing it if it has not been installed yet.
if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")


cat("Starting the installing of the libraries...")


# Using 'BiocManager' to install the following libraries (which are also mentioned in the introduction of this notebook).
BiocManager::install('limma')
BiocManager::install('preprocessCore')

# Using the R command 'install.packages()' to install the remaining necessary libraries.
install.packages("data.table")
install.packages("plotly")


cat("Finishing the installing of the libraries.")

Bioconductor version '3.16' is out-of-date; the current release version '3.17'
  is available with R version '4.3'; see https://bioconductor.org/install



Starting the installing of the libraries...

'getOption("repos")' replaces Bioconductor standard repositories, see
'help("repositories", package = "BiocManager")' for details.
Replacement repositories:
    CRAN: https://cran.r-project.org

Bioconductor version 3.16 (BiocManager 1.30.20), R 4.2.3 (2023-03-15 ucrt)

"package(s) not installed when version(s) same as or greater than current; use
  `force = TRUE` to re-install: 'limma'"
Installation paths not writeable, unable to update packages
  path: C:/Program Files/R/R-4.2.3/library
  packages:
    class, KernSmooth, lattice, MASS, Matrix, nnet, survival

Old packages: 'BiasedUrn', 'BiocManager', 'broom', 'bslib', 'cachem', 'curl',
  'DelayedArray', 'dplyr', 'evaluate', 'fs', 'gargle', 'googledrive',
  'googlesheets4', 'httpuv', 'httr', 'httr2', 'jsonlite', 'knitr', 'later',
  'locfit', 'matrixStats', 'pkgbuild', 'plotly', 'profvis', 'rlang',
  'rmarkdown', 'sass', 'sys', 'testthat', 'tzdb', 'usethis', 'vctrs',
  'viridisLite', 'vroom', 'waldo', 'xfun', 'xml2'

'getOption("repos"

package 'data.table' successfully unpacked and MD5 sums checked

The downloaded binary packages are in
	C:\Users\laure\AppData\Local\Temp\Rtmp6HxaiZ\downloaded_packages


Installing package into 'C:/Users/laure/AppData/Local/R/win-library/4.2'
(as 'lib' is unspecified)



package 'plotly' successfully unpacked and MD5 sums checked

The downloaded binary packages are in
	C:\Users\laure\AppData\Local\Temp\Rtmp6HxaiZ\downloaded_packages
Finishing the installing of the libraries.

Now that all the libraries have been installed we can load them into this notebook by using the command 'libraries()'. To verify that these libraries have been loaded into this notebook, we can use the command 'packageVersion()' which will display the version of the package currently installed.

In [2]:
# Loading the following libraries (which are also mentioned in the introduction of this notebook) into this notebook. 
library(limma)
library(preprocessCore)
library(data.table)
library(plotly)


# Retrieving the version of the packages to verify they have been correctly loaded into this notebook.
cat("The library 'limma' has been loaded into the notebook with its version being:")
packageVersion("limma")

cat("The library 'preprocessCore' has been loaded into the notebook with its version being:")
packageVersion("preprocessCore")

cat("The library 'data.table' has been loaded into the notebook with its version being:")
packageVersion("data.table")

cat("The library 'plotly' has been loaded into the notebook with its version being:")
packageVersion("plotly")

Loading required package: ggplot2


Attaching package: 'plotly'


The following object is masked from 'package:ggplot2':

    last_plot


The following object is masked from 'package:stats':

    filter


The following object is masked from 'package:graphics':

    layout




The library 'limma' has been loaded into the notebook with its version being:

[1] '3.54.2'

The library 'preprocessCore' has been loaded into the notebook with its version being:

[1] '1.60.2'

The library 'data.table' has been loaded into the notebook with its version being:

[1] '1.14.8'

The library 'plotly' has been loaded into the notebook with its version being:

[1] '4.10.2'

### Defining the data directories

In addition, we need to define our data directories from which the files will be loaded and to which the resulting file will be stored. Please mind that these need to be changed to the desired directories to be able to work with the data directories.

In [3]:
data_directory_final_datasets = "C:/Users/laure/OneDrive/Documenten/Bachelor Thesis Data/final_datasets"
data_directory_training_and_test_splits = "C:/Users/laure/OneDrive/Documenten/Bachelor Thesis Data/training_and_test_splits"

## Loading Gene Expression Data Files

Within this section, we can load the gene expression data files from the directory 'data_directory_final_datasets' by calling the function 'fread()' with as argument the path of the file that should be loaded into this notebook as a data table. To explicitly call the function 'fread()' from the 'data.table' package, the '::' operator is used.

#### Loading the 'gene_expression_data_final.csv' file into this notebook

In [4]:
# The path of the gene expression data file 'gene_expression_data_final.csv' to be loaded into this notebook.
path <- file.path(data_directory_final_datasets, "gene_expression_data_final.csv")

# If the 'path' defined above does not point to a file, the execution of this code block is terminated and an error message
# is displayed.
if (!file.exists(path)) {
  stop("File not found: ", path)
}

# Loading the gene expression data as a data table into this notebook by calling the function 'fread()'.
gene_expression_data <- data.table::fread(path)

cat("The gene expression data:")
gene_expression_data

The gene expression data:

Samples,ENSG00000000419,ENSG00000000457,ENSG00000000460,ENSG00000000938,ENSG00000000971,ENSG00000001036,ENSG00000001084,ENSG00000001167,ENSG00000001460,⋯,ENSG00000288558,ENSG00000288559,ENSG00000288573,ENSG00000288586,ENSG00000288596,ENSG00000288612,ENSG00000288658,ENSG00000288667,ENSG00000288670,ENSG00000288675
<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,⋯,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
TCGA-06-0125-01A-01,97.4399,6.5428,5.9849,7.0651,16.9301,63.7275,14.5107,24.2783,0.9795,⋯,9.1356,15.5325,5.4792,1.9232,7.2505,5.5178,0.7025,0.0000,12.7288,4.1649
TCGA-06-0125-02A-11,63.4521,5.4929,3.1369,16.4290,16.4273,66.9871,22.8885,18.1955,1.4115,⋯,6.0041,8.3656,4.2820,1.6080,2.8211,2.1185,0.6252,1.4481,15.1055,2.2084
TCGA-06-0152-02A-01,98.1366,6.3809,5.7963,21.9912,19.3073,73.1009,9.4840,24.5692,1.3706,⋯,9.1559,10.9328,4.6383,1.7929,6.1256,3.5844,2.2429,0.0000,12.0011,3.3086
TCGA-06-0171-02A-11,55.5088,5.0426,2.7663,65.6843,50.7959,65.8725,20.9938,22.3543,4.6538,⋯,3.5008,3.8793,2.8926,0.9067,2.7872,3.3664,0.7014,0.0000,7.6533,0.5195
TCGA-06-0190-01A-01,80.6678,4.8642,6.9529,25.5317,24.9163,101.1771,17.4425,28.6862,5.4792,⋯,4.4420,7.2659,3.2201,0.9027,4.5347,2.6690,0.1727,0.0000,6.0973,3.0503
TCGA-06-0190-02A-01,85.4610,4.0998,5.4879,11.9973,45.3018,96.1271,9.7493,30.9364,3.6715,⋯,4.8917,8.1890,2.2898,0.5617,2.9595,2.2014,0.4030,0.0000,5.2496,1.3760
TCGA-06-0210-01A-01,80.0162,4.8043,2.8904,26.2095,12.3576,57.4271,12.6304,19.3338,1.4185,⋯,6.7796,5.0551,3.4160,0.4623,4.7931,1.9550,0.1769,0.0000,7.5160,1.7184
TCGA-06-0210-02A-01,49.7091,4.8995,4.2395,17.0881,14.5879,50.5623,14.7003,22.7468,2.1311,⋯,8.0275,4.6493,2.6001,0.3240,3.8818,1.4535,0.8135,0.0000,9.0069,0.9442
TCGA-06-0211-01A-01,85.0854,5.4429,6.2343,20.3002,8.0931,74.1980,9.0816,26.4136,2.1530,⋯,7.9559,8.2357,5.7024,0.7652,5.1043,2.0124,0.3294,1.3351,11.1323,1.9876
TCGA-06-0211-02A-02,119.5198,5.3726,4.5333,16.2691,12.6697,60.3359,9.0157,23.5439,2.1772,⋯,5.8372,6.3923,3.6478,0.7159,4.4773,1.8310,1.1505,0.0000,7.4212,1.7900


#### Loading the 'gene_expression_data_log2_transformed_final.csv' file into this notebook

In [5]:
# The path of the gene expression data file 'gene_expression_data_log2_transformed_final.csv' to be loaded into this 
# notebook.
path <- file.path(data_directory_final_datasets, "gene_expression_data_log2_transformed_final.csv")

# If the 'path' defined above does not point to a file, the execution of this code block is terminated and an error message
# is displayed.
if (!file.exists(path)) {
  stop("File not found: ", path)
}

# Loading the gene expression data as a data table into this notebook by calling the function 'fread()'.
gene_expression_data_log2_transformed <- data.table::fread(path)

cat("The gene expression data log2-transformed:")
gene_expression_data_log2_transformed

The gene expression data log2-transformed:

Samples,ENSG00000000419,ENSG00000000457,ENSG00000000460,ENSG00000000938,ENSG00000000971,ENSG00000001036,ENSG00000001084,ENSG00000001167,ENSG00000001460,⋯,ENSG00000288558,ENSG00000288559,ENSG00000288573,ENSG00000288586,ENSG00000288596,ENSG00000288612,ENSG00000288658,ENSG00000288667,ENSG00000288670,ENSG00000288675
<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,⋯,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
TCGA-06-0125-01A-01,6.621171,2.915100,2.804239,3.011692,4.164312,6.016307,3.955192,4.659828,0.9851361,⋯,3.341360,4.047233,2.695816,1.5475485,3.044482,2.7043851,0.7676548,0.000000,3.779134,2.3687404
TCGA-06-0125-02A-11,6.010155,2.698863,2.048550,4.123418,4.123277,6.087189,4.578244,4.262696,1.2699308,⋯,2.808200,3.227371,2.401084,1.3829439,1.933988,1.6408523,0.7006173,1.291662,4.009482,1.6818540
TCGA-06-0152-02A-01,6.631346,2.883797,2.764750,4.523010,4.343927,6.211419,3.390117,4.676335,1.2452523,⋯,3.344246,3.576861,2.495260,1.4817639,2.833011,2.1967329,1.6972845,0.000000,3.700562,2.1072192
TCGA-06-0171-02A-11,5.820404,2.595169,1.913148,6.059275,5.694766,6.063341,4.459025,4.545616,2.4992208,⋯,2.170181,2.286674,1.960734,0.9310779,1.921132,2.1264443,0.7667224,0.000000,3.113250,0.6035967
TCGA-06-0190-01A-01,6.351695,2.551934,2.991481,4.729645,4.695788,6.674928,4.204962,4.891721,2.6958157,⋯,2.444137,3.047172,2.077277,0.9280481,2.468505,1.8753869,0.2298340,0.000000,2.827270,2.0180288
TCGA-06-0190-02A-01,6.433978,2.350441,2.697752,3.700140,5.532996,6.601802,3.426171,4.997130,2.2238859,⋯,2.558684,3.199908,1.718000,0.6431173,1.985318,1.6787029,0.4885150,0.000000,2.643764,1.2485348
TCGA-06-0210-01A-01,6.340139,2.537122,1.959918,4.766039,3.739589,5.868566,3.768756,4.345808,1.2741125,⋯,2.959696,2.598151,2.142740,0.5482393,2.534336,1.5631581,0.2349917,0.000000,3.090176,1.4427578
TCGA-06-0210-02A-01,5.664173,2.560593,2.389429,4.176969,3.962355,5.688245,3.972720,4.569661,1.6466696,⋯,3.174327,2.498072,1.848037,0.4049031,2.287413,1.2948413,0.8587767,0.000000,3.322923,0.9591766
TCGA-06-0211-01A-01,6.427697,2.687710,2.854853,4.412795,3.184772,6.232622,3.333653,4.776820,1.6567252,⋯,3.162838,3.207221,2.744678,0.8198317,2.609826,1.5909134,0.4107753,1.223484,3.600781,1.5789870
TCGA-06-0211-02A-02,6.913126,2.671882,2.468140,4.110121,3.772910,5.938660,3.324191,4.617293,1.6677559,⋯,2.773406,2.886023,2.216548,0.7789655,2.453465,1.5013117,1.1046721,0.000000,3.074026,1.4802651


## Loading Training and Test Split Data

Within this section, we can load the training and test split data files from the directory 'data_directory_training_and_test_splits' into this notebook by calling the function 'fread()' with as argument the path of the file that should be loaded into this notebook as a data table. To explicitly call the function 'fread()' from the 'data.table' package, the '::' operator is used.

#### Loading the 'fold_assignments_samples.csv' file into this notebook

In [6]:
# The path of the fold assignments file 'fold_assignments_samples.csv' to be loaded into this notebook.
path <- file.path(data_directory_training_and_test_splits, "fold_assignments_samples.csv")

# If the 'path' defined above does not point to a file, the execution of this code block is terminated and an error message
# is displayed.
if (!file.exists(path)) {
  stop("File not found: ", path)
}

# Loading the fold assignments data as a data table into this notebook by calling the function 'fread()'.
fold_assignments <- data.table::fread(path)

cat("The fold assignments data for the samples:")
fold_assignments

The fold assignments data for the samples:

Samples,Fold
<chr>,<int>
TCGA-06-0125-01A-01,1
TCGA-06-0125-02A-11,1
TCGA-06-0152-02A-01,2
TCGA-06-0171-02A-11,1
TCGA-06-0190-01A-01,4
TCGA-06-0190-02A-01,3
TCGA-06-0210-01A-01,3
TCGA-06-0210-02A-01,3
TCGA-06-0211-01A-01,4
TCGA-06-0211-02A-02,2


#### Loading the 'training_and_test_assignments_samples.csv' file into this notebook

In [7]:
# The path of the training and test assignments file 'training_and_test_assignments_samples.csv' to be loaded into this 
# notebook.
path <- file.path(data_directory_training_and_test_splits, "training_and_test_assignments_samples.csv")

# If the 'path' defined above does not point to a file, the execution of this code block is terminated and an error message
# is displayed.
if (!file.exists(path)) {
  stop("File not found: ", path)
}

# Loading the training and test assignments data as a data table into this notebook by calling the function 'fread()'.
training_and_test_assignments <- data.table::fread(path)

cat("The training and test assignments data for the samples:")
training_and_test_assignments

The training and test assignments data for the samples:

Samples,Split 1,Split 2,Split 3,Split 4
<chr>,<chr>,<chr>,<chr>,<chr>
TCGA-06-0125-01A-01,TEST,TRAIN,TRAIN,TRAIN
TCGA-06-0125-02A-11,TEST,TRAIN,TRAIN,TRAIN
TCGA-06-0152-02A-01,TRAIN,TEST,TRAIN,TRAIN
TCGA-06-0171-02A-11,TEST,TRAIN,TRAIN,TRAIN
TCGA-06-0190-01A-01,TRAIN,TRAIN,TRAIN,TEST
TCGA-06-0190-02A-01,TRAIN,TRAIN,TEST,TRAIN
TCGA-06-0210-01A-01,TRAIN,TRAIN,TEST,TRAIN
TCGA-06-0210-02A-01,TRAIN,TRAIN,TEST,TRAIN
TCGA-06-0211-01A-01,TRAIN,TRAIN,TRAIN,TEST
TCGA-06-0211-02A-02,TRAIN,TEST,TRAIN,TRAIN


## Quantile Normalizing the Gene Expression Data

As mentioned within the 'Introduction', we will also quantile normalize the gene expression data after the data has been splitted into training and test sets. We should do this quantile normalization at this point and not somewhere before the data splitting as quantile normalization is not a within-sample normalization method. This means that if we were to perform this quantile normalization before the training and test splits, the information from the test set could potentially influence the normalization process, compromising the independence of the test set [1]. This information leakage would result in an overestimation of the model's performance when evaluated on the test set. Therefore, it is crucial to normalize the gene expression data separately within the training set and the test set which is now possible as we have defined the training and test sets for each of the four splits (which will be used by every machine learning algorithm to ensure that the comparisons of the results of the different machine learning algorithms are consistent and fair).

To quantile normalize the samples for a single split, we first normalize the samples that are present in the training set. From this normalized testing set, one sample can be retrieved that serves as a reference sample. The samples in the test set are then normalized using this reference sample. To achieve this, we define a function called 'quantile_normalize_gene_expression_data(gene_expression_data, split_number)' below which will normalize the passed on gene expression data and this function will be called from the subsections each dedicated to one of the four splits.

In [8]:
# The function 'quantile_normalize_gene_expression_data(gene_expression_data, split_number)' defined below quantile 
# normalizes the passed on 'gene_expression_data' taking into account which samples belong to the training and test set in 
# the current 'split_number'.
quantile_normalize_gene_expression_data <- function(gene_expression_data, split_number) {
    
    # Creating the lists 'training_set_samples' and 'test_set_samples' who later will contain respectively the samples 
    # belonging to the training set and the samples belonging to the test set for the current 'split_number'.
    training_set_samples <- c()
    test_set_samples <- c()
    
    # Looping over all the samples present within the 'fold_assignments' data table and checking to which set the samples 
    # belong to.
    for (i in 1:nrow(fold_assignments)) {
        # If the current sample is not present within the fold that is the test set for the 'split_number', the sample is 
        # added to the 'training_set_samples' list.
        if (as.integer(fold_assignments$Fold[i]) != as.integer(split_number)) {
            training_set_samples <- c(training_set_samples, fold_assignments$Samples[i])
        }
        # Else if the current sample is present within the fold that is the test set for the 'split_number', the sample is 
        # added to the 'test_set_samples' list.
        else {
            test_set_samples <- c(test_set_samples, fold_assignments$Samples[i])
        }
    }
    
    # Retrieving the subsets of the passed on 'gene_expression_data' data table which only contain respectively the samples 
    # belonging to the training set and the samples belonging to the test set.
    gene_expression_data_training_set <- gene_expression_data[gene_expression_data$Samples %in% training_set_samples, ]
    gene_expression_data_test_set <- gene_expression_data[gene_expression_data$Samples %in% test_set_samples, ]
    
    # Retrieving only the gene expression values (and thus removing the sample names) of the 
    # 'gene_expression_data_training_set' data table and transforming it to a matrix by calling the function 'as.matrix()'.
    gene_expression_values_training <- as.matrix(gene_expression_data_training_set[, -1])
    
    # Since the function 'normalizeQuantiles()' assumes that the samples are represented by the columns of the matrix, we 
    # have to transpose the 'gene_expression_values_training' first by calling the function 't()'.
    gene_expression_values_training_T <- t(gene_expression_values_training)
    
    # Quantile normalizing the gene expression data table 'gene_expression_values_training_T' by calling the function 
    # 'normalizeQuantiles()' from the 'limma' library.
    normalized_matrix_gene_expression_data_training_set_T <- normalizeQuantiles(gene_expression_values_training_T)
    
    # Since the normalized matrix 'normalized_matrix_gene_expression_data_training_set' is the transposed version of the
    # original one, we can transpose it back by calling the function 't()'.
    normalized_matrix_gene_expression_data_training_set <- t(normalized_matrix_gene_expression_data_training_set_T)
    
    # Converting the resulting matrix 'normalized_matrix_gene_expression_data_training_set' to a data table by calling the 
    # function 'as.data.table()', adding the names of the genes as column headers by calling the function 'setnames()' and 
    # adding the 'Samples' column back to the normalized 'normalized_gene_expression_data_training_set' data table by 
    # calling the function 'cbind()'.
    normalized_gene_expression_data_training_set <- as.data.table(normalized_matrix_gene_expression_data_training_set)
    setnames(normalized_gene_expression_data_training_set, names(gene_expression_data_training_set[, -1]))
    normalized_data_training <- cbind(Samples = gene_expression_data_training_set$Samples, normalized_gene_expression_data_training_set)
    
    # Now that we have normalized the gene expression data of which the samples belong to the training set, we can proceed 
    # to normalize the gene expression data of which the samples belong to test set. Within the steps below we first choose 
    # a random reference sample from the normalized training set. Since all of the samples contain the same values (but the 
    # genes that correspond to these values are different of course) and therefore the same distribution, we can assign 
    # these values to every sample present in the test set. This means that for a sample present in the test set the highest 
    # value is assigned to the gene with the original highest value, the second highest value is assigned to the gene with 
    # the original second highest value, and so on.
    
    # Selecting a random reference sample from the 'normalized_data_training' data table by calling the 'sample()' function.
    random_row_number <- sample(1:48, 1)
    reference_sample <- normalized_data_training[random_row_number,]
    
    # Extracting the normalization factors from the 'reference_sample' by calling the function 'as.vector()'.
    normalization_factors <- as.vector(as.matrix(reference_sample[, -1]))
    
    # Sorting the 'normalization_factors' such that the lowest value appears as the first element and the highest value 
    # appears as the last element.
    normalization_factors_sorted <- as.matrix(sort(normalization_factors))
    
    # Looping over all the samples present within the 'gene_expression_data_test_set' data table and reassigning the values 
    # to their genes such that they follow the same distribution as the 'reference_sample'.
    for (i in 1:nrow(gene_expression_data_test_set)) {
        # Retrieving the current sample from the 'gene_expression_data_test_set' data table and transposing the sample such 
        # that the gene IDs make up the rownames.
        sample <- t(gene_expression_data_test_set[i,-1])
        # Sorting the sample by calling the function 'order()' such that the lowest value in the sample appears as the first 
        # element and the highest value appears as the last element.
        sorted_sample <- sample[order(sample[, 1]), , drop = FALSE]
        
        # Looping over all the values of the 'sorted_sample' and assigning the values present in the sorted matrix 
        # 'normalization_factors_sorted' to be the new values in the 'sorted_sample'.
        for (j in 1:nrow(sorted_sample)) {
            sorted_sample[j,] <- normalization_factors_sorted[j,]
        } 
        
        # Looping over all the genes present in the 'gene_expression_data_test_set' data table and making sure by assigning 
        # the new values in the 'sorted_sample' to the corresponding gene that the gene with the original highest value is 
        # also assigned the highest value again, the original second highest value is assigned the second highest value 
        # again, and so on.
        for (gene in colnames(gene_expression_data_test_set[,-1])) {
            gene_expression_data_test_set[i, gene] <- sorted_sample[which(rownames(sorted_sample) == gene)]
        }
    }
    
    # Appending the two resulting 'normalized_data_training' and 'gene_expression_data_test_set' data tables such that all
    # 64 samples are again present in a single data table which can be achieved by calling the function 'rbind()'.
    combined_normalized_training_and_test_data <- rbind(normalized_data_training, gene_expression_data_test_set)
    
    # Since the appending in the previous step just vertically pastes them together, we should make sure that all the 
    # samples appear in the same order (alphabetically) as in the original 'gene_expression_data' data table which can be 
    # achieved by calling the function 'order()'.
    combined_normalized_training_and_test_data_sorted <- combined_normalized_training_and_test_data[order(combined_normalized_training_and_test_data$Samples),]
    
    return(combined_normalized_training_and_test_data_sorted)
}

#### Quantile Normalizing the Gene Expression Data for Split 1

We can now quantile normalize the gene expression data for the first split by calling the function defined above called 'quantile_normalize_gene_expression_data(gene_expression_data, split_number)'. This can be done for all the gene expression data tables.

In [None]:
# Quantile normalizing the 'gene_expression_data' data table for split 1.
normalized_gene_expression_data_split1 <- quantile_normalize_gene_expression_data(gene_expression_data, 1)

# Quantile normalizing the 'gene_expression_data_log2_transformed' data table for split 1.
normalized_gene_expression_data_log2_transformed_split1 <- quantile_normalize_gene_expression_data(gene_expression_data_log2_transformed, 1)

Next, we can store these resulting normalized gene expression data tables by calling the function 'fwrite()' from the 'data.table' library which takes as arguments the data table to be stored and the location where the data table should be stored.

In [None]:
# The path where the 'normalized_gene_expression_data_split1' data table should be stored.
path <- file.path(data_directory_final_datasets, "gene_expression_data_normalized_split1.csv")

# If the 'path' defined above does already point to a file, the execution of this code block is terminated and a message is 
# displayed informing that the file already exists.
if (file.exists(path)) {
    cat(paste("There is already a file present at the path: ", path))
} else {
    # Writing the data table to a CSV file.
    data.table::fwrite(normalized_gene_expression_data_split1, path)
    cat(paste("The file has been created at the path: ", path))
}


# The path where the 'normalized_gene_expression_data_log2_transformed_split1' data table should be stored.
path <- file.path(data_directory_final_datasets, "gene_expression_data_log2_transformed_normalized_split1.csv")

# If the 'path' defined above does already point to a file, the execution of this code block is terminated and a message is 
# displayed informing that the file already exists.
if (file.exists(path)) {
    cat(paste("\nThere is already a file present at the path: ", path))
} else {
    # Writing the data table to a CSV file.
    data.table::fwrite(normalized_gene_expression_data_log2_transformed_split1, path)
    cat(paste("\nThe file has been created at the path: ", path))
}

#### Quantile Normalizing the Gene Expression Data for Split 2

We can now quantile normalize the gene expression data for the second split by calling the function defined above called 'quantile_normalize_gene_expression_data(gene_expression_data, split_number)'. This can be done for all the gene expression data tables.

In [11]:
# Quantile normalizing the 'gene_expression_data' data table for split 2.
normalized_gene_expression_data_split2 <- quantile_normalize_gene_expression_data(gene_expression_data, 2)

# Quantile normalizing the 'gene_expression_data_log2_transformed' data table for split 2.
normalized_gene_expression_data_log2_transformed_split2 <- quantile_normalize_gene_expression_data(gene_expression_data_log2_transformed, 2)

Next, we can store these resulting normalized gene expression data tables by calling the function 'fwrite()' from the 'data.table' library which takes as arguments the data table to be stored and the location where the data table should be stored.

In [12]:
# The path where the 'normalized_gene_expression_data_split2' data table should be stored.
path <- file.path(data_directory_final_datasets, "gene_expression_data_normalized_split2.csv")

# If the 'path' defined above does already point to a file, the execution of this code block is terminated and a message is 
# displayed informing that the file already exists.
if (file.exists(path)) {
    cat(paste("There is already a file present at the path: ", path))
} else {
    # Writing the data table to a CSV file.
    data.table::fwrite(normalized_gene_expression_data_split2, path)
    cat(paste("The file has been created at the path: ", path))
}


# The path where the 'normalized_gene_expression_data_log2_transformed_split2' data table should be stored.
path <- file.path(data_directory_final_datasets, "gene_expression_data_log2_transformed_normalized_split2.csv")

# If the 'path' defined above does already point to a file, the execution of this code block is terminated and a message is 
# displayed informing that the file already exists.
if (file.exists(path)) {
    cat(paste("\nThere is already a file present at the path: ", path))
} else {
    # Writing the data table to a CSV file.
    data.table::fwrite(normalized_gene_expression_data_log2_transformed_split2, path)
    cat(paste("\nThe file has been created at the path: ", path))
}

The file has been created at the path:  C:/Users/laure/OneDrive/Documenten/Bachelor Thesis Data/final_datasets/Distance Analysis/gene_expression_data_genes_removed_normalized_split2.csv
The file has been created at the path:  C:/Users/laure/OneDrive/Documenten/Bachelor Thesis Data/final_datasets/Distance Analysis/gene_expression_data_log2_transformed_genes_removed_normalized_split2.csv

#### Quantile Normalizing the Gene Expression Data for Split 3

We can now quantile normalize the gene expression data for the third split by calling the function defined above called 'quantile_normalize_gene_expression_data(gene_expression_data, split_number)'. This can be done for all the gene expression data tables.

In [13]:
# Quantile normalizing the 'gene_expression_data' data table for split 3.
normalized_gene_expression_data_split3 <- quantile_normalize_gene_expression_data(gene_expression_data, 3)

# Quantile normalizing the 'gene_expression_data_log2_transformed' data table for split 3.
normalized_gene_expression_data_log2_transformed_split3 <- quantile_normalize_gene_expression_data(gene_expression_data_log2_transformed, 3)

Next, we can store these resulting normalized gene expression data tables by calling the function 'fwrite()' from the 'data.table' library which takes as arguments the data table to be stored and the location where the data table should be stored.

In [14]:
# The path where the 'normalized_gene_expression_data_split3' data table should be stored.
path <- file.path(data_directory_final_datasets, "gene_expression_data_normalized_split3.csv")

# If the 'path' defined above does already point to a file, the execution of this code block is terminated and a message is 
# displayed informing that the file already exists.
if (file.exists(path)) {
    cat(paste("There is already a file present at the path: ", path))
} else {
    # Writing the data table to a CSV file.
    data.table::fwrite(normalized_gene_expression_data_split3, path)
    cat(paste("\nThe file has been created at the path: ", path))
}


# The path where the 'normalized_gene_expression_data_log2_transformed_split3' data table should be stored.
path <- file.path(data_directory_final_datasets, "gene_expression_data_log2_transformed_normalized_split3.csv")

# If the 'path' defined above does already point to a file, the execution of this code block is terminated and a message is 
# displayed informing that the file already exists.
if (file.exists(path)) {
    cat(paste("\nThere is already a file present at the path: ", path))
} else {
    # Writing the data table to a CSV file.
    data.table::fwrite(normalized_gene_expression_data_log2_transformed_split3, path)
    cat(paste("The file has been created at the path: ", path))
}


The file has been created at the path:  C:/Users/laure/OneDrive/Documenten/Bachelor Thesis Data/final_datasets/Distance Analysis/gene_expression_data_genes_removed_normalized_split3.csvThe file has been created at the path:  C:/Users/laure/OneDrive/Documenten/Bachelor Thesis Data/final_datasets/Distance Analysis/gene_expression_data_log2_transformed_genes_removed_normalized_split3.csv

#### Quantile Normalizing the Gene Expression Data for Split 4

We can now quantile normalize the gene expression data for the fourth split by calling the function defined above called 'quantile_normalize_gene_expression_data(gene_expression_data, split_number)'. This can be done for all the gene expression data tables.

In [15]:
# Quantile normalizing the 'gene_expression_data' data table for split 4.
normalized_gene_expression_data_split4 <- quantile_normalize_gene_expression_data(gene_expression_data, 4)

# Quantile normalizing the 'gene_expression_data_log2_transformed' data table for split 4.
normalized_gene_expression_data_log2_transformed_split4 <- quantile_normalize_gene_expression_data(gene_expression_data_log2_transformed, 4)

Next, we can store these resulting normalized gene expression data tables by calling the function 'fwrite()' from the 'data.table' library which takes as arguments the data table to be stored and the location where the data table should be stored.

In [16]:
# The path where the 'normalized_gene_expression_data_split4' data table should be stored.
path <- file.path(data_directory_final_datasets, "gene_expression_data_normalized_split4.csv")

# If the 'path' defined above does already point to a file, the execution of this code block is terminated and a message is 
# displayed informing that the file already exists.
if (file.exists(path)) {
    cat(paste("There is already a file present at the path: ", path))
} else {
    # Writing the data table to a CSV file.
    data.table::fwrite(normalized_gene_expression_data_split4, path)
    cat(paste("The file has been created at the path: ", path))
}


# The path where the 'normalized_gene_expression_data_log2_transformed_split4' data table should be stored.
path <- file.path(data_directory_final_datasets, "gene_expression_data_log2_transformed_normalized_split4.csv")

# If the 'path' defined above does already point to a file, the execution of this code block is terminated and a message is 
# displayed informing that the file already exists.
if (file.exists(path)) {
    cat(paste("\nThere is already a file present at the path: ", path))
} else {
    # Writing the data table to a CSV file.
    data.table::fwrite(normalized_gene_expression_data_log2_transformed_split4, path)
    cat(paste("\nThe file has been created at the path: ", path))
}

The file has been created at the path:  C:/Users/laure/OneDrive/Documenten/Bachelor Thesis Data/final_datasets/Distance Analysis/gene_expression_data_genes_removed_normalized_split4.csv
The file has been created at the path:  C:/Users/laure/OneDrive/Documenten/Bachelor Thesis Data/final_datasets/Distance Analysis/gene_expression_data_log2_transformed_genes_removed_normalized_split4.csv