# Linear Regression for Testing the Datasets
### Laurence Nickel (i6257119)

Libraries used: 
* pandas (version: '1.2.4')
* re (version: '2.2.1')
* sys (version: '3.8.8')
* os (version: '3.8.8')
* plotly.express (version: '5.13.1')
* matplotlib.pyplot (version: '3.3.4')
* sklearn.linear_model (version: '0.24.1')
* sklearn.metrics (version: '0.24.1')

References:
* [1] Lederer, J. (2002). "Linear Regression," in *Fundamentals of High-Dimensional Statistics* (Berlin/Heidelberg, Germany: Springer), 37-79.
* [2] Mallik, S., Seth, S., Bhadra, T., & Zhao, Z. (2020). A Linear Regression and Deep Learning Approach for Detecting Reliable Genetic Alterations in Cancer Using DNA Methylation and Gene Expression Data. *Genes, 11*(8): 931. doi: https://doi.org/10.3390/genes11080931.
* [3] Miles, J. (2005). "R-squared, Adjusted R-squared," in *Encyclopedia of Statistics in Behavioral Science - Volume 4*, eds B. S. Everitt \& D. C. Howell (Hoboken, NJ, USA: John Wiley \& Sons), 1655-1657. doi: https://doi.org/10.1002/0470013192.bsa526.
* [4] scikit-learn (2023). sklearn.metrics.r2_score - scikit-learn 1.2.2. Available: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.r2_score.html (last accessed June 7, 2023).

## Introduction

Within this notebook, the machine learning algorithm linear regression is performed to predict the expression levels for the genes considering the methylation values for the CpG sites with the goal of finding the combination of datasets that performs the best. The methylation and gene expression datasets are the ones originating from the folder 'Bachelor Thesis Data/final_datasets'.

Linear regression is a statistical modeling technique which relates a dependent variable to one or more independent variables [1]. This is done by finding the line of best fit which is a linear equation that minimizes the sum of squared residuals. As the name states, linear regression assumes that there is a linear relationship between the independent variables and the dependent variable but by including polynomials it is also able to effectively model non-linear relationships. We can apply linear regression to predict gene expression values from methylation data as it has already been successfully performed before showing that linear regression is suitable for working with gene expression and methylation data [2]. Please mind that this does not mean that linear regression is necessarily the best performing (regression) method for predicting gene expression values from methylation data, but applying the algorithm does provide us with reasonable results for our purpose of finding the combination of datasets that performs the best.

At this point, we have collected 2 kinds of methylation data files and 4 kinds of gene expression data files and these are presented in the overview below:
* Regular methylation data file
* M-transformed methylation data file
* Regular gene expression data file
* Log2-transformed gene expression data file
* Normalized regular gene expression data files (this consists of multiple files as we normalized the gene expression data for each split)
* Normalized log2-transformed gene expression data files (this consists of multiple files as we normalized the gene expression data for each split)

For each combination of these kinds (where we of course can only consider one kind of methylation data file and one kind of gene expression data file at a time) linear regression models are built, one for each gene, and these are evaluated. This results in a total of 8 (4 x 2) evaluation measurements, one for each combination of the kinds of data. The best performing combination will be chosen and used for the notebooks where we will apply different kinds of machine learning algorithms for both the Distance Analysis part and the CpG Site Analysis part.

Since the purpose of this notebook is to compare the different combinations of kinds of data files and not to actually assess the results further, we can reduce the computational burden by only building models for the genes which are located on chromosome 1 and select the CpG sites which are present within a distance of 5,000,000 in both directions from the gene. Of course, these exact settings will be used for all the 8 experiments such that the results are directly comparable. To retrieve the prediction accuracy of the gene expression values, the R-squared (R<sup>2</sup>) metric is computed for each of the predictions which indicates the proportion of the variance in the dependent variable that can be explained by the model [3]. Higher R<sup>2</sup> values indicate a more significant proportion of the variance in the dependent variable that can be explained by the model, with 1 being the largest possible value. This R<sup>2</sup> value is retrieved by applying 4-fold cross-validation using the training and test splits defined in the notebook 'Training and Test Set Division.ipynb' present in the 'Machine Learning Algorithms - Preprocessing' folder, which also includes the motivation behind choosing the k in k-fold cross-validation to be set equal to four, and averaging the R<sup>2</sup> value for each of the four folds.

### Importing libraries

Before we can start to define all the functions, we should first import some libraries that will be used throughout this notebook.

In [1]:
print("Starting the importing of the libraries...")


import pandas as pd
import re
import sys
import os

# Here we first need to install the plotly library.
!pip install plotly
import plotly
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns

import sklearn
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score

!pip install joblib
import joblib
from joblib import Parallel, delayed



print("Finishing the installing of the libraries.")

Starting the importing of the libraries...
Finishing the installing of the libraries.


Now that all the libraries have been imported, we can verify that these libraries have been loaded into this notebook by calling the version property of the library.

In [2]:
# Retrieving the version of the libraries to verify they have been correctly loaded into this notebook.
print("The library 'pd' (pandas) has been loaded into the notebook with its version being:")
print(pd.__version__)

print("\nThe library 're' has been loaded into the notebook with its version being:")
print(re.__version__)

print("\nThe library 'sys' has been loaded into the notebook with its version being:")
print(sys.version)

print("\nThe library 'plotly' has been loaded into the notebook with its version being:")
print(plotly.__version__)

print("\nThe library 'sns' (seaborn) has been loaded into the notebook with its version being:")
print(sns.__version__)

print("\nThe library 'matplotlib' has been loaded into the notebook with its version being:")
print(matplotlib.__version__)

print("\nThe library 'sklearn' has been loaded into the notebook with its version being:")
print(sklearn.__version__)

print("\nThe library 'joblib' has been loaded into the notebook with its version being:")
print(joblib.__version__)

The library 'pd' (pandas) has been loaded into the notebook with its version being:
1.2.4

The library 're' has been loaded into the notebook with its version being:
2.2.1

The library 'sys' has been loaded into the notebook with its version being:
3.8.8 (default, Apr 13 2021, 15:08:03) [MSC v.1916 64 bit (AMD64)]

The library 'plotly' has been loaded into the notebook with its version being:
5.13.1

The library 'sns' (seaborn) has been loaded into the notebook with its version being:
0.11.1

The library 'matplotlib' has been loaded into the notebook with its version being:
3.3.4

The library 'sklearn' has been loaded into the notebook with its version being:
0.24.1

The library 'joblib' has been loaded into the notebook with its version being:
1.0.1


### Defining the data directories

In addition, we also need to define our data directories from which the gene expression and methylation data files and the training and test splits data files will be loaded. Please mind that these need to be changed to the desired directories to be able to work with the data directories.

In [23]:
data_directory_final_datasets = "C:/Users/laure/OneDrive/Documenten/Bachelor Thesis Data/final_datasets"
data_directory_training_and_test_splits = "C:/Users/laure/OneDrive/Documenten/Bachelor Thesis Data/training_and_test_splits"
data_directory_results_dataset = "C:/Users/laure/OneDrive/Documenten/Bachelor Thesis Data/results/Dataset Combination Testing"

## Loading Training and Test Split Data

Within this section, we can load the training and test split data files from the directory 'data_directory_training_and_test_splits' into this notebook by calling the function 'pd.read_csv()' with as a parameter the to be read file.

#### Loading the 'fold_assignments_samples.csv' file into this notebook

In [4]:
# Loading the file 'fold_assignments_samples.csv'.
fold_assignments = pd.read_csv(data_directory_training_and_test_splits + '/fold_assignments_samples.csv')

print("The 'fold_assignments' DataFrame:")
fold_assignments

The 'fold_assignments' DataFrame:


Unnamed: 0,Samples,Fold
0,TCGA-06-0125-01A-01,1
1,TCGA-06-0125-02A-11,1
2,TCGA-06-0152-02A-01,2
3,TCGA-06-0171-02A-11,1
4,TCGA-06-0190-01A-01,4
...,...,...
59,TCGA-76-4927-01A-01,1
60,TCGA-76-4928-01B-01,3
61,TCGA-76-4929-01A-01,2
62,TCGA-76-4931-01A-01,2


#### Loading the 'training_and_test_assignments_samples.csv' file into this notebook

In [5]:
# Loading the file 'training_and_test_assignments_samples.csv'.
training_and_test_assignments = pd.read_csv(data_directory_training_and_test_splits + '/training_and_test_assignments_samples.csv')

print("The 'training_and_test_assignments' DataFrame:")
training_and_test_assignments

The 'training_and_test_assignments' DataFrame:


Unnamed: 0,Samples,Split 1,Split 2,Split 3,Split 4
0,TCGA-06-0125-01A-01,TEST,TRAIN,TRAIN,TRAIN
1,TCGA-06-0125-02A-11,TEST,TRAIN,TRAIN,TRAIN
2,TCGA-06-0152-02A-01,TRAIN,TEST,TRAIN,TRAIN
3,TCGA-06-0171-02A-11,TEST,TRAIN,TRAIN,TRAIN
4,TCGA-06-0190-01A-01,TRAIN,TRAIN,TRAIN,TEST
...,...,...,...,...,...
59,TCGA-76-4927-01A-01,TEST,TRAIN,TRAIN,TRAIN
60,TCGA-76-4928-01B-01,TRAIN,TRAIN,TEST,TRAIN
61,TCGA-76-4929-01A-01,TRAIN,TEST,TRAIN,TRAIN
62,TCGA-76-4931-01A-01,TRAIN,TEST,TRAIN,TRAIN


## Loading all the Different Datasets

Within this section, all the different datasets from the directory 'data_directory_final_datasets' that will be experimented with are loaded into this notebook. For each of the corresponding files, this can be achieved by calling the function 'pd.read_csv()' with as a parameter the to be read file.

### Loading the Methylation Data Files

Within this subsection, all the methylation data files from the directory 'data_directory_final_datasets' are loaded into this notebook.

#### Loading the 'methylation_data_final.csv' file into this notebook

In [7]:
# Loading the file 'methylation_data_final.csv'.
methylation_data = pd.read_csv(data_directory_final_datasets + '/methylation_data_final.csv')

print("The 'methylation_data' DataFrame:")
methylation_data

The 'methylation_data' DataFrame:


Unnamed: 0,Samples,cg00000957,cg00001349,cg00001583,cg00002837,cg00003287,cg00004121,cg00008647,cg00009292,cg00011717,...,ch.22.28920330F,ch.22.436090R,ch.22.441164F,ch.22.528917R,ch.22.569473R,ch.22.707049R,ch.22.728807R,ch.22.734399R,ch.22.772318F,ch.22.909671F
0,TCGA-06-0125-01A-01,0.897657,0.939643,0.022323,0.835868,0.609458,0.573148,0.077747,0.515607,0.962390,...,0.047403,0.247444,0.023274,0.048324,0.092222,0.030177,0.042784,0.259508,0.031644,0.041573
1,TCGA-06-0125-02A-11,0.901618,0.934091,0.021567,0.721376,0.643082,0.607448,0.092594,0.702981,0.981128,...,0.121478,0.232804,0.027033,0.044326,0.104435,0.031269,0.086301,0.227487,0.031728,0.072908
2,TCGA-06-0152-02A-01,0.943356,0.925107,0.688869,0.463914,0.510393,0.633310,0.342348,0.727519,0.981531,...,0.039210,0.176567,0.021798,0.031384,0.166394,0.038748,0.058598,0.201834,0.034949,0.066590
3,TCGA-06-0171-02A-11,0.946300,0.892850,0.316700,0.468883,0.199371,0.608267,0.051528,0.343695,0.987614,...,0.090471,0.110574,0.018514,0.059065,0.084249,0.030588,0.084550,0.123179,0.024192,0.048582
4,TCGA-06-0190-01A-01,0.900578,0.913929,0.039247,0.459649,0.385464,0.563823,0.256340,0.620945,0.960253,...,0.061767,0.136060,0.022281,0.045415,0.076932,0.023218,0.069550,0.169489,0.033715,0.060410
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
59,TCGA-76-4927-01A-01,0.914987,0.944935,0.453672,0.498965,0.369733,0.650784,0.029887,0.702710,0.983409,...,0.044978,0.256822,0.090731,0.177181,0.083665,0.098730,0.101680,0.369930,0.137996,0.282453
60,TCGA-76-4928-01B-01,0.809969,0.934087,0.015485,0.559392,0.378124,0.573251,0.021810,0.563863,0.981317,...,0.023597,0.326470,0.043576,0.113330,0.072018,0.078825,0.063560,0.248358,0.051274,0.166855
61,TCGA-76-4929-01A-01,0.899810,0.915504,0.014822,0.764733,0.036500,0.588300,0.023903,0.747113,0.918559,...,0.051385,0.256155,0.062192,0.132334,0.089363,0.179526,0.146808,0.280876,0.157513,0.248242
62,TCGA-76-4931-01A-01,0.846634,0.925314,0.917137,0.600138,0.613448,0.647793,0.019436,0.912443,0.942751,...,0.035018,0.261144,0.050211,0.050398,0.076935,0.117766,0.064936,0.357379,0.050092,0.217461


#### Loading the 'methylation_data_M_transformed_final.csv' file into this notebook

In [8]:
# Loading the file 'methylation_data_M_transformed_final.csv'.
methylation_data_M_transformed = pd.read_csv(data_directory_final_datasets + '/methylation_data_M_transformed_final.csv')

print("The 'methylation_data_M_transformed' DataFrame:")
methylation_data_M_transformed

The 'methylation_data_M_transformed' DataFrame:


Unnamed: 0,Samples,cg00000957,cg00001349,cg00001583,cg00002837,cg00003287,cg00004121,cg00008647,cg00009292,cg00011717,...,ch.22.28920330F,ch.22.436090R,ch.22.441164F,ch.22.528917R,ch.22.569473R,ch.22.707049R,ch.22.728807R,ch.22.734399R,ch.22.772318F,ch.22.909671F
0,TCGA-06-0125-01A-01,3.132755,3.960518,-5.452737,2.348422,0.642046,0.425173,-3.568310,0.090094,4.677447,...,-4.328807,-1.604700,-5.391144,-4.299671,-3.299155,-5.006189,-4.483695,-1.512707,-4.935511,-4.526937
1,TCGA-06-0125-02A-11,3.196057,3.825019,-5.503606,1.372434,0.849407,0.629880,-3.292764,1.242929,5.700119,...,-2.854379,-1.720475,-5.169584,-4.430285,-3.100187,-4.953307,-3.404266,-1.763778,-4.931599,-3.668569
2,TCGA-06-0152-02A-01,4.057813,3.626717,1.146710,-0.208610,0.059986,0.788350,-0.941862,1.416831,5.731835,...,-4.614915,-2.221432,-5.487836,-4.947837,-2.324764,-4.632719,-4.005875,-1.983516,-4.787276,-3.809136
3,TCGA-06-0171-02A-11,4.139295,3.058785,-1.109405,-0.179801,-2.005682,0.634832,-4.202176,-0.933237,6.317184,...,-3.329587,-3.007868,-5.728251,-3.993713,-3.442227,-4.986055,-3.436608,-2.831527,-5.334004,-4.291577
4,TCGA-06-0190-01A-01,3.179215,3.408476,-4.613496,-0.233366,-0.672902,0.370326,-1.536587,0.712060,4.594485,...,-3.925038,-2.666686,-5.455511,-4.393648,-3.584784,-5.394690,-3.741805,-2.292808,-4.840999,-3.959180
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
59,TCGA-76-4927-01A-01,3.427987,4.101014,-0.268118,-0.005975,-0.769480,0.898056,-5.020576,1.241059,5.889282,...,-4.408250,-1.532935,-3.325032,-2.215352,-3.453181,-3.190405,-3.143200,-0.768262,-2.643066,-1.345063
60,TCGA-76-4928-01B-01,2.091628,3.824918,-5.990499,0.344366,-0.717767,0.425782,-5.487022,0.370565,5.714939,...,-5.370823,-1.044793,-4.456032,-2.967873,-3.687673,-3.546754,-3.881002,-1.597628,-4.209691,-2.319970
61,TCGA-76-4929-01A-01,3.166884,3.437609,-6.054600,1.700659,-4.722319,0.514961,-5.351739,1.562834,3.495553,...,-4.206406,-1.537983,-3.914480,-2.712960,-3.349122,-2.192265,-2.538937,-1.356304,-2.419188,-1.598524
62,TCGA-76-4931-01A-01,2.464759,3.631037,3.468339,0.585791,0.666281,0.879107,-5.656808,3.381448,4.041543,...,-4.784318,-1.500444,-4.241542,-4.235881,-3.584725,-2.905236,-3.847968,-0.846516,-4.245142,-1.847410


### Loading the Gene Expression Data Files

Within this subsection, all the gene expression data files from the directory 'data_directory_final_datasets' are loaded into this notebook.

#### Loading the 'gene_expression_data_final.csv' file into this notebook

In [9]:
# Loading the file 'gene_expression_data_final.csv'.
gene_expression_data = pd.read_csv(data_directory_final_datasets + '/gene_expression_data_final.csv')

print("The 'gene_expression_data' DataFrame:")
gene_expression_data

The 'gene_expression_data' DataFrame:


Unnamed: 0,Samples,ENSG00000000419,ENSG00000000457,ENSG00000000460,ENSG00000000938,ENSG00000000971,ENSG00000001036,ENSG00000001084,ENSG00000001167,ENSG00000001460,...,ENSG00000288558,ENSG00000288559,ENSG00000288573,ENSG00000288586,ENSG00000288596,ENSG00000288612,ENSG00000288658,ENSG00000288667,ENSG00000288670,ENSG00000288675
0,TCGA-06-0125-01A-01,97.4399,6.5428,5.9849,7.0651,16.9301,63.7275,14.5107,24.2783,0.9795,...,9.1356,15.5325,5.4792,1.9232,7.2505,5.5178,0.7025,0.0000,12.7288,4.1649
1,TCGA-06-0125-02A-11,63.4521,5.4929,3.1369,16.4290,16.4273,66.9871,22.8885,18.1955,1.4115,...,6.0041,8.3656,4.2820,1.6080,2.8211,2.1185,0.6252,1.4481,15.1055,2.2084
2,TCGA-06-0152-02A-01,98.1366,6.3809,5.7963,21.9912,19.3073,73.1009,9.4840,24.5692,1.3706,...,9.1559,10.9328,4.6383,1.7929,6.1256,3.5844,2.2429,0.0000,12.0011,3.3086
3,TCGA-06-0171-02A-11,55.5088,5.0426,2.7663,65.6843,50.7959,65.8725,20.9938,22.3543,4.6538,...,3.5008,3.8793,2.8926,0.9067,2.7872,3.3664,0.7014,0.0000,7.6533,0.5195
4,TCGA-06-0190-01A-01,80.6678,4.8642,6.9529,25.5317,24.9163,101.1771,17.4425,28.6862,5.4792,...,4.4420,7.2659,3.2201,0.9027,4.5347,2.6690,0.1727,0.0000,6.0973,3.0503
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
59,TCGA-76-4927-01A-01,64.4713,5.4887,4.6042,11.1335,13.0230,79.9233,16.0202,25.2603,8.1108,...,11.8293,8.2600,2.7040,1.2529,3.3570,1.2770,1.4806,0.0000,8.0572,5.0802
60,TCGA-76-4928-01B-01,89.2847,4.2106,3.8754,15.9864,6.9736,84.9414,9.9544,19.4110,3.5411,...,7.1290,4.8864,2.6234,0.3575,4.1562,2.0797,0.4309,0.0000,6.9073,1.6309
61,TCGA-76-4929-01A-01,90.8538,7.6404,2.3827,23.5793,10.0796,29.7248,18.8568,47.1435,5.5205,...,9.7622,6.5039,4.7981,3.0880,4.4396,2.8821,3.6902,0.0000,9.9713,8.2106
62,TCGA-76-4931-01A-01,84.3173,6.6323,5.9630,10.5260,9.3104,28.4352,16.0662,27.2980,3.2969,...,9.5540,12.3694,8.1206,1.9674,8.9614,9.1919,0.9034,0.0000,13.6745,6.0827


#### Loading the 'gene_expression_data_log2_transformed_final.csv' file into this notebook

In [10]:
# Loading the file 'gene_expression_data_log2_transformed_final.csv'.
gene_expression_data_log2_transformed = pd.read_csv(data_directory_final_datasets + '/gene_expression_data_log2_transformed_final.csv')

print("The 'gene_expression_data_log2_transformed' DataFrame:")
gene_expression_data_log2_transformed

The 'gene_expression_data_log2_transformed' DataFrame:


Unnamed: 0,Samples,ENSG00000000419,ENSG00000000457,ENSG00000000460,ENSG00000000938,ENSG00000000971,ENSG00000001036,ENSG00000001084,ENSG00000001167,ENSG00000001460,...,ENSG00000288558,ENSG00000288559,ENSG00000288573,ENSG00000288586,ENSG00000288596,ENSG00000288612,ENSG00000288658,ENSG00000288667,ENSG00000288670,ENSG00000288675
0,TCGA-06-0125-01A-01,6.621171,2.915100,2.804239,3.011692,4.164312,6.016307,3.955192,4.659828,0.985136,...,3.341360,4.047233,2.695816,1.547549,3.044482,2.704385,0.767655,0.000000,3.779134,2.368740
1,TCGA-06-0125-02A-11,6.010155,2.698863,2.048550,4.123418,4.123277,6.087189,4.578244,4.262696,1.269931,...,2.808200,3.227371,2.401084,1.382944,1.933988,1.640852,0.700617,1.291662,4.009482,1.681854
2,TCGA-06-0152-02A-01,6.631346,2.883797,2.764750,4.523010,4.343927,6.211419,3.390117,4.676335,1.245252,...,3.344246,3.576861,2.495260,1.481764,2.833011,2.196733,1.697285,0.000000,3.700562,2.107219
3,TCGA-06-0171-02A-11,5.820404,2.595169,1.913148,6.059275,5.694766,6.063341,4.459025,4.545616,2.499221,...,2.170181,2.286674,1.960734,0.931078,1.921132,2.126444,0.766722,0.000000,3.113250,0.603597
4,TCGA-06-0190-01A-01,6.351695,2.551934,2.991481,4.729645,4.695788,6.674928,4.204962,4.891721,2.695816,...,2.444137,3.047172,2.077277,0.928048,2.468505,1.875387,0.229834,0.000000,2.827270,2.018029
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
59,TCGA-76-4927-01A-01,6.032791,2.697929,2.486508,3.600924,3.809723,6.338483,4.089176,4.714811,3.187578,...,3.681371,3.211012,1.889084,1.171783,2.123335,1.187134,1.310689,0.000000,3.179065,2.604119
60,TCGA-76-4928-01B-01,6.496410,2.381450,2.285521,4.086308,2.995231,6.425281,3.453439,4.351275,2.183042,...,3.023078,2.557386,1.857344,0.440952,2.366308,1.622790,0.516923,0.000000,2.983185,1.395556
61,TCGA-76-4929-01A-01,6.521268,3.111098,1.758175,4.619372,3.469834,4.941332,4.311561,5.589269,2.704983,...,3.427901,2.907641,2.535580,2.031395,2.443501,1.956837,2.229649,0.000000,3.455663,3.203295
62,TCGA-76-4931-01A-01,6.414766,2.932118,2.799709,3.526820,3.366028,4.879471,4.093070,4.822628,2.103296,...,3.399718,3.740863,3.189129,1.569199,3.316349,3.349351,0.928579,0.000000,3.875239,2.824299


#### Loading the 'gene_expression_data_normalized_split1.csv' file into this notebook

In [11]:
# Loading the file 'gene_expression_data_normalized_split1.csv'.
gene_expression_data_normalized_split1 = pd.read_csv(data_directory_final_datasets + '/gene_expression_data_normalized_split1.csv')

print("The 'gene_expression_data_normalized_split1' DataFrame:")
gene_expression_data_normalized_split1

The 'gene_expression_data_normalized_split1' DataFrame:


Unnamed: 0,Samples,ENSG00000000419,ENSG00000000457,ENSG00000000460,ENSG00000000938,ENSG00000000971,ENSG00000001036,ENSG00000001084,ENSG00000001167,ENSG00000001460,...,ENSG00000288558,ENSG00000288559,ENSG00000288573,ENSG00000288586,ENSG00000288596,ENSG00000288612,ENSG00000288658,ENSG00000288667,ENSG00000288670,ENSG00000288675
0,TCGA-06-0125-01A-01,92.395923,5.592885,5.152120,6.034660,14.265919,58.189679,12.180719,20.777148,0.878850,...,7.823031,13.033977,4.710858,1.667219,6.229808,4.751248,0.664158,0.000000,10.717804,3.577775
1,TCGA-06-0125-02A-11,61.272477,5.791354,3.420497,16.059698,16.051794,64.588040,22.089260,17.637084,1.577971,...,6.306842,8.570617,4.595219,1.824963,3.108812,2.417575,0.655131,1.631946,14.850537,2.516827
2,TCGA-06-0152-02A-01,90.189492,5.357808,4.882085,18.506898,16.181287,66.043798,7.981333,20.677194,1.063256,...,7.712098,9.150985,3.925498,1.427671,5.152810,3.012615,1.835002,0.000000,10.011069,2.766775
3,TCGA-06-0171-02A-11,63.271015,5.910127,3.338842,75.564758,57.038738,75.831121,23.113773,24.756669,5.461121,...,4.158667,4.608862,3.493456,1.042963,3.356892,4.004025,0.788238,0.000000,8.670717,0.556231
4,TCGA-06-0190-01A-01,81.748792,4.876040,6.987848,25.713692,25.161404,101.743177,17.600723,28.948329,5.487485,...,4.406533,7.335000,3.161388,0.777585,4.514737,2.570310,0.125754,0.000000,6.062337,2.971911
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
59,TCGA-76-4927-01A-01,59.765756,5.090363,4.328779,10.477358,12.293281,73.834217,15.231383,23.779848,7.566715,...,11.092560,7.716329,2.489931,1.150373,3.123292,1.176077,1.363433,0.066244,7.505877,4.737979
60,TCGA-76-4928-01B-01,92.951767,5.091652,4.690288,18.012963,8.035187,87.601704,11.373797,21.694108,4.307865,...,8.179704,5.828337,3.263569,0.479171,5.027802,2.592683,0.571456,0.000000,7.969385,2.042146
61,TCGA-76-4929-01A-01,88.382979,6.821756,2.113394,21.992348,9.127008,28.134000,17.481229,45.414854,4.872065,...,8.863875,5.783933,4.250063,2.760856,3.935242,2.572373,3.300958,0.000000,9.051967,7.342150
62,TCGA-76-4931-01A-01,72.117629,5.256129,4.663081,8.389902,7.401098,23.442254,13.051358,22.585619,2.454860,...,7.595277,9.928027,6.453398,1.412165,7.079771,7.316944,0.651940,0.000000,11.024108,4.746487


#### Loading the 'gene_expression_data_normalized_split2.csv' file into this notebook

In [12]:
# Loading the file 'gene_expression_data_normalized_split2.csv'.
gene_expression_data_normalized_split2 = pd.read_csv(data_directory_final_datasets + '/gene_expression_data_normalized_split2.csv')

print("The 'gene_expression_data_normalized_split2' DataFrame:")
gene_expression_data_normalized_split2

The 'gene_expression_data_normalized_split2' DataFrame:


Unnamed: 0,Samples,ENSG00000000419,ENSG00000000457,ENSG00000000460,ENSG00000000938,ENSG00000000971,ENSG00000001036,ENSG00000001084,ENSG00000001167,ENSG00000001460,...,ENSG00000288558,ENSG00000288559,ENSG00000288573,ENSG00000288586,ENSG00000288596,ENSG00000288612,ENSG00000288658,ENSG00000288667,ENSG00000288670,ENSG00000288675
0,TCGA-06-0125-01A-01,90.864152,5.469377,5.045302,5.897977,13.955179,57.314642,11.916056,20.329408,0.874454,...,7.641990,12.758038,4.615098,1.645458,6.086062,4.653973,0.660283,0.000000,10.478610,3.514196
1,TCGA-06-0125-02A-11,60.347152,5.662601,3.358621,15.702106,15.695502,63.585004,21.622683,17.251348,1.557433,...,6.160573,8.370938,4.503850,1.803462,3.058060,2.384606,0.651856,1.608733,14.517767,2.481023
2,TCGA-06-0152-02A-01,88.644858,5.243674,4.781198,18.121496,15.817519,65.031250,7.800521,20.235117,1.053631,...,7.533165,8.943212,3.851741,1.411979,5.046850,2.963507,1.811792,0.000000,9.789221,2.725384
3,TCGA-06-0171-02A-11,62.299154,5.777435,3.281008,74.310862,56.161508,74.576260,22.637992,24.260388,5.342046,...,4.078760,4.517073,3.431348,1.033673,3.298019,3.930485,0.783915,0.000000,8.468324,0.553294
4,TCGA-06-0190-01A-01,80.379581,4.775944,6.827890,25.210287,24.661865,100.225946,17.218688,28.386675,5.368348,...,4.319033,7.164058,3.109604,0.773356,4.424054,2.533069,0.125440,0.000000,5.924823,2.925026
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
59,TCGA-76-4927-01A-01,58.851306,4.985565,4.243767,10.243752,12.029308,72.665552,14.892681,23.305412,7.393108,...,10.847371,7.537267,2.455956,1.139398,3.072727,1.162737,1.347795,0.000000,7.332262,4.640681
60,TCGA-76-4928-01B-01,91.443421,4.986846,4.595260,17.632340,7.850472,86.085642,11.123934,21.230344,4.223285,...,7.990271,5.697481,3.208083,0.475969,4.924550,2.554606,0.568646,0.000000,7.788627,2.015817
61,TCGA-76-4929-01A-01,86.842225,6.664465,2.087512,21.523338,8.920004,27.584235,17.100992,44.657579,4.771646,...,8.661808,5.653684,4.167679,2.719148,3.861135,2.535503,3.244425,0.044547,8.846131,7.173619
62,TCGA-76-4931-01A-01,70.958912,5.145937,4.569608,8.193619,7.229725,22.967488,12.772771,22.119715,2.422013,...,7.419556,9.706644,6.304706,1.395977,6.914371,7.147723,0.648331,0.000000,10.778983,4.649444


#### Loading the 'gene_expression_data_normalized_split3.csv' file into this notebook

In [13]:
# Loading the file 'gene_expression_data_normalized_split3.csv'.
gene_expression_data_normalized_split3 = pd.read_csv(data_directory_final_datasets + '/gene_expression_data_normalized_split3.csv')

print("The 'gene_expression_data_normalized_split3' DataFrame:")
gene_expression_data_normalized_split3

The 'gene_expression_data_normalized_split3' DataFrame:


Unnamed: 0,Samples,ENSG00000000419,ENSG00000000457,ENSG00000000460,ENSG00000000938,ENSG00000000971,ENSG00000001036,ENSG00000001084,ENSG00000001167,ENSG00000001460,...,ENSG00000288558,ENSG00000288559,ENSG00000288573,ENSG00000288586,ENSG00000288596,ENSG00000288612,ENSG00000288658,ENSG00000288667,ENSG00000288670,ENSG00000288675
0,TCGA-06-0125-01A-01,92.643479,5.554490,5.117994,5.990969,14.099735,58.171740,12.042321,20.552102,0.881812,...,7.754169,12.887973,4.681129,1.668446,6.182529,4.720377,0.666077,0.000000,10.594535,3.563727
1,TCGA-06-0125-02A-11,61.285881,5.750553,3.407871,15.876535,15.870181,64.595708,21.841648,17.437969,1.578327,...,6.259142,8.482771,4.568890,1.827740,3.102473,2.417473,0.657050,1.630729,14.676342,2.514120
2,TCGA-06-0152-02A-01,90.425298,5.321987,4.850169,18.306354,15.993129,66.055483,7.910198,20.454769,1.064688,...,7.644098,9.050790,3.906700,1.429371,5.119667,3.006783,1.836279,0.000000,9.903071,2.764312
3,TCGA-06-0171-02A-11,63.260348,5.868942,3.326735,75.610658,57.004554,75.882769,22.867354,24.505625,5.424283,...,4.137190,4.581996,3.480506,1.044606,3.344681,3.985127,0.790502,0.000000,8.578374,0.557356
4,TCGA-06-0190-01A-01,81.867710,4.844560,6.932683,25.451563,24.904531,102.168133,17.403659,28.642858,5.452125,...,4.381225,7.272477,3.155092,0.779692,4.489119,2.567902,0.126567,0.000000,6.017823,2.966700
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
59,TCGA-76-4927-01A-01,59.762069,5.057619,4.304779,10.355394,12.155594,73.905440,15.052058,23.537542,7.504148,...,10.966402,7.648248,2.489392,1.151658,3.116690,1.176156,1.365236,0.000000,7.444206,4.707060
60,TCGA-76-4928-01B-01,93.206735,5.059146,4.660581,17.817292,7.959573,87.775473,11.245223,21.447854,4.283869,...,8.102115,5.788087,3.253177,0.479085,4.994944,2.591031,0.572585,0.042842,7.898844,2.042521
61,TCGA-76-4929-01A-01,88.549940,6.768592,2.115298,21.742292,9.028700,27.835531,17.283142,45.209773,4.840271,...,8.772669,5.742533,4.227221,2.758069,3.915946,2.569969,3.289715,0.000000,8.955750,7.280938
62,TCGA-76-4931-01A-01,72.180933,5.221285,4.634294,8.303963,7.340183,23.200921,12.903740,22.342504,2.455167,...,7.530298,9.820481,6.402781,1.413229,7.021733,7.254800,0.653629,0.000000,10.896767,4.715715


#### Loading the 'gene_expression_data_normalized_split4.csv' file into this notebook

In [14]:
# Loading the file 'gene_expression_data_normalized_split4.csv'.
gene_expression_data_normalized_split4 = pd.read_csv(data_directory_final_datasets + '/gene_expression_data_normalized_split4.csv')

print("The 'gene_expression_data_normalized_split4' DataFrame:")
gene_expression_data_normalized_split4

The 'gene_expression_data_normalized_split4' DataFrame:


Unnamed: 0,Samples,ENSG00000000419,ENSG00000000457,ENSG00000000460,ENSG00000000938,ENSG00000000971,ENSG00000001036,ENSG00000001084,ENSG00000001167,ENSG00000001460,...,ENSG00000288558,ENSG00000288559,ENSG00000288573,ENSG00000288586,ENSG00000288596,ENSG00000288612,ENSG00000288658,ENSG00000288667,ENSG00000288670,ENSG00000288675
0,TCGA-06-0125-01A-01,93.574333,5.532442,5.093881,5.975663,14.329223,59.039677,12.218129,20.923542,0.852044,...,7.786423,13.084544,4.650765,1.628127,6.171881,4.690996,0.639925,0.000000,10.717237,3.522271
1,TCGA-06-0125-02A-11,62.196615,5.732521,3.364135,16.152073,16.144910,65.568017,22.247096,17.741108,1.539275,...,6.249106,8.541494,4.534698,1.785113,3.057648,2.371975,0.631365,1.591275,14.919804,2.468817
2,TCGA-06-0152-02A-01,91.334846,5.299294,4.822523,18.628483,16.270983,67.026219,7.949135,20.823546,1.031788,...,7.673102,9.130975,3.866931,1.391685,5.095310,2.961906,1.793615,0.000000,10.004590,2.719123
3,TCGA-06-0171-02A-11,64.210465,5.853896,3.284421,76.559233,57.870812,76.827281,23.299737,24.976412,5.401850,...,4.099758,4.548788,3.437998,1.012115,3.302277,3.946356,0.762180,0.000000,8.641816,0.533869
4,TCGA-06-0190-01A-01,82.854429,4.817163,6.943460,25.954552,25.388750,103.069175,17.704606,29.245569,5.429298,...,4.348679,7.292652,3.110660,0.751435,4.453808,2.524169,0.121940,0.021260,6.003435,2.921890
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
59,TCGA-76-4927-01A-01,60.639400,5.033192,4.270000,10.471302,12.332592,74.839873,15.311202,23.986648,7.527985,...,11.105048,7.677369,2.443208,1.116577,3.071817,1.141263,1.327585,0.000000,7.466948,4.677210
60,TCGA-76-4928-01B-01,94.123215,5.034619,4.630340,18.127719,8.001357,88.706937,11.390432,21.840988,4.248931,...,8.146598,5.768685,3.209433,0.457931,4.968760,2.545229,0.548931,0.000000,7.936256,1.999967
61,TCGA-76-4929-01A-01,89.505375,6.776044,2.070583,22.145654,9.107313,28.406733,17.588569,46.039544,4.812681,...,8.841323,5.724377,4.192513,2.713033,3.876290,2.524738,3.246783,0.000000,9.032115,7.300965
62,TCGA-76-4931-01A-01,73.078031,5.197560,4.603517,8.357360,7.359781,23.639825,13.101275,22.760025,2.409167,...,7.555987,9.919123,6.398133,1.375717,7.036244,7.275121,0.627956,0.000000,11.031585,4.686204


#### Loading the 'gene_expression_data_log2_transformed_normalized_split1.csv' file into this notebook

In [15]:
# Loading the file 'gene_expression_data_log2_transformed_normalized_split1.csv'.
gene_expression_data_log2_transformed_normalized_split1 = pd.read_csv(data_directory_final_datasets + '/gene_expression_data_log2_transformed_normalized_split1.csv')

print("The 'gene_expression_data_log2_transformed_normalized_split1' DataFrame:")
gene_expression_data_log2_transformed_normalized_split1

The 'gene_expression_data_log2_transformed_normalized_split1' DataFrame:


Unnamed: 0,Samples,ENSG00000000419,ENSG00000000457,ENSG00000000460,ENSG00000000938,ENSG00000000971,ENSG00000001036,ENSG00000001084,ENSG00000001167,ENSG00000001460,...,ENSG00000288558,ENSG00000288559,ENSG00000288573,ENSG00000288586,ENSG00000288596,ENSG00000288612,ENSG00000288658,ENSG00000288667,ENSG00000288670,ENSG00000288675
0,TCGA-06-0125-01A-01,6.533349,2.704108,2.604151,2.797547,3.914457,5.874208,3.702543,4.427246,0.897246,...,3.124324,3.792986,2.497078,1.399378,2.836973,2.507082,0.723615,0.000000,3.533081,2.178071
1,TCGA-06-0125-02A-11,5.947611,2.746882,2.127599,4.074776,4.074104,6.022565,4.511751,4.202261,1.350551,...,2.852190,3.241371,2.467472,1.482815,2.022154,1.755953,0.716025,1.380275,3.968692,1.797769
2,TCGA-06-0152-02A-01,6.498897,2.651714,2.539539,4.268359,4.085031,6.054255,3.149954,4.420587,1.030658,...,3.106052,3.326178,2.283841,1.264227,2.604475,1.987989,1.487125,0.000000,3.443437,1.896848
3,TCGA-06-0171-02A-11,5.993315,2.771783,2.100627,6.246146,5.845655,6.251154,4.574760,4.670252,2.674999,...,2.350273,2.470969,2.151988,1.016616,2.106779,2.306761,0.826050,0.000000,3.256354,0.627584
4,TCGA-06-0190-01A-01,6.358379,2.538073,2.980774,4.723047,4.692796,6.671121,4.199602,4.888434,2.680896,...,2.418074,3.042082,2.040503,0.817751,2.446512,1.819542,0.167427,0.000000,2.803203,1.973243
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
59,TCGA-76-4927-01A-01,5.912184,2.589768,2.397147,3.503163,3.714812,6.213226,4.002927,4.614260,3.081541,...,3.578381,3.106755,1.786672,1.090049,2.027217,1.106408,1.225726,0.090607,3.071429,2.503711
60,TCGA-76-4928-01B-01,6.541902,2.590073,2.491692,4.231344,3.158638,6.457320,3.611531,4.486742,2.391481,...,3.181489,2.754749,2.075494,0.555282,2.574848,1.828505,0.641716,0.000000,3.148041,1.588611
61,TCGA-76-4929-01A-01,6.469988,2.950512,1.621953,4.505624,3.322793,4.848625,4.190256,5.522366,2.537083,...,3.284888,2.745305,2.375791,1.894583,2.286698,1.820384,2.088083,0.000000,3.312110,3.043292
62,TCGA-76-4931-01A-01,6.179462,2.628447,2.484797,3.214033,3.053456,4.594336,3.794749,4.542674,1.772077,...,3.086528,3.432486,2.880973,1.255027,2.997202,3.038967,0.713106,0.000000,3.570206,2.505861


#### Loading the 'gene_expression_data_log2_transformed_normalized_split2.csv' file into this notebook

In [16]:
# Loading the file 'gene_expression_data_log2_transformed_normalized_split2.csv'.
gene_expression_data_log2_transformed_normalized_split2 = pd.read_csv(data_directory_final_datasets + '/gene_expression_data_log2_transformed_normalized_split2.csv')

print("The 'gene_expression_data_log2_transformed_normalized_split2' DataFrame:")
gene_expression_data_log2_transformed_normalized_split2

The 'gene_expression_data_log2_transformed_normalized_split2' DataFrame:


Unnamed: 0,Samples,ENSG00000000419,ENSG00000000457,ENSG00000000460,ENSG00000000938,ENSG00000000971,ENSG00000001036,ENSG00000001084,ENSG00000001167,ENSG00000001460,...,ENSG00000288558,ENSG00000288559,ENSG00000288573,ENSG00000288586,ENSG00000288596,ENSG00000288612,ENSG00000288658,ENSG00000288667,ENSG00000288670,ENSG00000288675
0,TCGA-06-0125-01A-01,6.511864,2.679318,2.581366,2.771958,3.888402,5.855118,3.677011,4.400996,0.894186,...,3.097648,3.768090,2.474682,1.388681,2.810776,2.484637,0.720787,0.000000,3.506873,2.159561
1,TCGA-06-0125-02A-11,5.928326,2.721814,2.109031,4.047999,4.047426,6.002645,4.485978,4.175872,1.340136,...,2.825863,3.214334,2.445805,1.472162,2.005694,1.743758,0.713504,1.368547,3.941764,1.784290
2,TCGA-06-0152-02A-01,6.476629,2.627834,2.516815,4.243139,4.057883,6.034621,3.123868,4.394586,1.024919,...,3.079130,3.299840,2.263810,1.255869,2.581739,1.971745,1.476586,0.000000,3.417596,1.881975
3,TCGA-06-0171-02A-11,5.973575,2.746529,2.083020,6.224662,5.826103,6.229740,4.549524,4.645610,2.650607,...,2.329748,2.449261,2.132852,1.010983,2.088704,2.286937,0.823425,0.000000,3.229273,0.625489
4,TCGA-06-0190-01A-01,6.336626,2.515502,2.954740,4.698991,4.668409,6.651824,4.173289,4.864280,2.656611,...,2.396489,3.015388,2.024004,0.814957,2.424570,1.805699,0.167112,0.000000,2.777578,1.957507
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
59,TCGA-76-4927-01A-01,5.892651,2.567032,2.375840,3.477029,3.689634,6.192767,3.976254,4.589850,3.055413,...,3.552491,3.080011,1.773842,1.083575,2.010895,1.099150,1.217097,0.000000,3.044903,2.481211
60,TCGA-76-4928-01B-01,6.520915,2.567341,2.469549,4.205746,3.132073,6.434871,3.585815,4.460673,2.370191,...,3.154630,2.729385,2.058154,0.552583,2.552173,1.814443,0.639547,0.000000,3.121911,1.577341
61,TCGA-76-4929-01A-01,6.447379,2.924254,1.611203,4.479612,3.296470,4.824371,4.163912,5.501339,2.514256,...,3.258467,2.720053,2.354788,1.879758,2.266479,1.806854,2.070556,0.000000,3.285713,3.017078
62,TCGA-76-4931-01A-01,6.158757,2.605182,2.462943,3.186892,3.026966,4.569569,3.769634,4.517452,1.759587,...,3.059955,3.406611,2.854744,1.245002,2.970528,3.012510,0.710238,0.000000,3.544117,2.483465


#### Loading the 'gene_expression_data_log2_transformed_normalized_split3.csv' file into this notebook

In [17]:
# Loading the file 'gene_expression_data_log2_transformed_normalized_split3.csv'.
gene_expression_data_log2_transformed_normalized_split3 = pd.read_csv(data_directory_final_datasets + '/gene_expression_data_log2_transformed_normalized_split3.csv')

print("The 'gene_expression_data_log2_transformed_normalized_split3' DataFrame:")
gene_expression_data_log2_transformed_normalized_split3

The 'gene_expression_data_log2_transformed_normalized_split3' DataFrame:


Unnamed: 0,Samples,ENSG00000000419,ENSG00000000457,ENSG00000000460,ENSG00000000938,ENSG00000000971,ENSG00000001036,ENSG00000001084,ENSG00000001167,ENSG00000001460,...,ENSG00000288558,ENSG00000288559,ENSG00000288573,ENSG00000288586,ENSG00000288596,ENSG00000288612,ENSG00000288658,ENSG00000288667,ENSG00000288670,ENSG00000288675
0,TCGA-06-0125-01A-01,6.538236,2.698971,2.599515,2.791829,3.902005,5.875722,3.690631,4.415416,0.900762,...,3.116263,3.781247,2.492536,1.402112,2.830874,2.502432,0.726817,0.000000,3.521085,2.176641
1,TCGA-06-0125-02A-11,5.949717,2.741451,2.126602,4.062450,4.061910,6.024425,4.499331,4.190123,1.352745,...,2.846122,3.231279,2.463748,1.485591,2.022770,1.758786,0.719077,1.381590,3.955966,1.799139
2,TCGA-06-0152-02A-01,6.503689,2.646871,2.534861,4.256732,4.072383,6.056166,3.141715,4.408880,1.033438,...,3.098065,3.315100,2.281261,1.267075,2.599919,1.988661,1.489906,0.000000,3.432374,1.898638
3,TCGA-06-0171-02A-11,5.994796,2.766564,2.099773,6.248379,5.846833,6.253518,4.562839,4.659031,2.670029,...,2.347357,2.467109,2.150123,1.019489,2.105698,2.304090,0.829654,0.000000,3.245761,0.630242
4,TCGA-06-0190-01A-01,6.361609,2.533489,2.974154,4.711681,4.681456,6.678077,4.187459,4.876360,2.676290,...,2.414324,3.034606,2.041272,0.821011,2.442862,1.821125,0.168975,0.000000,2.797359,1.974146
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
59,TCGA-76-4927-01A-01,5.913918,2.585209,2.393659,3.491000,3.703090,6.215961,3.990141,4.603042,3.074534,...,3.566548,3.098757,1.788885,1.092618,2.027787,1.108884,1.228629,0.000000,3.064288,2.499069
60,TCGA-76-4928-01B-01,6.546914,2.585744,2.487293,4.219639,3.149682,6.461304,3.599773,4.474169,2.387980,...,3.172608,2.749517,2.074978,0.556544,2.570160,1.830410,0.644171,0.000000,3.139867,1.591197
61,TCGA-76-4929-01A-01,6.473820,2.943987,1.625106,4.492971,3.311922,4.836512,4.177936,5.518313,2.532405,...,3.274664,2.739739,2.372452,1.896250,2.283986,1.821973,2.087318,0.000000,3.301426,3.036063
62,TCGA-76-4931-01A-01,6.182259,2.623624,2.480576,3.203926,3.046369,4.582992,3.782863,4.530744,1.774627,...,3.078956,3.421380,2.874465,1.257471,2.990208,3.031526,0.716155,0.000000,3.558147,2.501235


#### Loading the 'gene_expression_data_log2_transformed_normalized_split4.csv' file into this notebook

In [18]:
# Loading the file 'gene_expression_data_log2_transformed_normalized_split4.csv'.
gene_expression_data_log2_transformed_normalized_split4 = pd.read_csv(data_directory_final_datasets + '/gene_expression_data_log2_transformed_normalized_split4.csv')

print("The 'gene_expression_data_log2_transformed_normalized_split4' DataFrame:")
gene_expression_data_log2_transformed_normalized_split4

The 'gene_expression_data_log2_transformed_normalized_split4' DataFrame:


Unnamed: 0,Samples,ENSG00000000419,ENSG00000000457,ENSG00000000460,ENSG00000000938,ENSG00000000971,ENSG00000001036,ENSG00000001084,ENSG00000001167,ENSG00000001460,...,ENSG00000288558,ENSG00000288559,ENSG00000288573,ENSG00000288586,ENSG00000288596,ENSG00000288612,ENSG00000288658,ENSG00000288667,ENSG00000288670,ENSG00000288675
0,TCGA-06-0125-01A-01,6.555136,2.693338,2.592807,2.788086,3.925562,5.899415,3.711472,4.442358,0.876780,...,3.121844,3.803232,2.483713,1.378874,2.828176,2.493979,0.703143,0.000000,3.537585,2.162162
1,TCGA-06-0125-02A-11,5.973447,2.736845,2.110939,4.087785,4.087185,6.048534,4.527025,4.215685,1.329444,...,2.843601,3.240901,2.453826,1.462444,2.005731,1.738252,0.695703,1.358539,3.980111,1.779118
2,TCGA-06-0152-02A-01,6.520622,2.640733,2.527011,4.282614,4.097761,6.079725,3.148420,4.435736,1.009217,...,3.103049,3.327358,2.268412,1.243232,2.593147,1.971255,1.466806,0.000000,3.446927,1.879817
3,TCGA-06-0171-02A-11,6.018729,2.762677,2.084327,6.269014,5.870879,6.273959,4.591100,4.687715,2.664114,...,2.335786,2.457480,2.135102,0.995361,2.090285,2.291709,0.805732,0.000000,3.255973,0.607723
4,TCGA-06-0190-01A-01,6.381573,2.525837,2.975919,4.741242,4.710452,6.693015,4.212870,4.907849,2.670317,...,2.404381,3.038103,2.024493,0.797054,2.432248,1.801986,0.163160,0.000000,2.793840,1.956577
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
59,TCGA-76-4927-01A-01,5.937405,2.578364,2.383113,3.506899,3.723988,6.236657,4.015223,4.631453,3.078630,...,3.584501,3.103757,1.768438,1.067833,2.010751,1.084422,1.204207,0.000000,3.068256,2.490470
60,TCGA-76-4928-01B-01,6.563483,2.578702,2.478472,4.245260,3.156874,6.478925,3.618168,4.501518,2.377346,...,3.180048,2.744635,2.058773,0.535252,2.562872,1.810554,0.621675,0.000000,3.146344,1.569393
61,TCGA-76-4929-01A-01,6.491748,2.945139,1.602981,4.520653,3.324016,4.867175,4.203871,5.546612,2.524563,...,3.285612,2.735101,2.361790,1.877470,2.271185,1.802216,2.071511,0.000000,3.313271,3.039531
62,TCGA-76-4931-01A-01,6.202613,2.617219,2.471596,3.212850,3.049750,4.611186,3.804912,4.558652,1.754077,...,3.083393,3.435620,2.873155,1.233559,2.992642,3.035043,0.692716,0.000000,3.575697,2.492748


## Linear Regression for Testing the Datasets

Within this section, a linear regression algorithm is applied for every of the 8 possible combinations of datasets with the goal of finding the combination of datasets that performs the best. As mentioned before, we have collected 2 kinds of methylation data files and 4 kinds of gene expression data files and these are presented in the overview below:
* Regular methylation data file
* M-transformed methylation data file
* Regular gene expression data file
* Log2-transformed gene expression data file
* Normalized regular gene expression data files (this consists of multiple files as we normalized the gene expression data for each split)
* Normalized log2-transformed gene expression data files (this consists of multiple files as we normalized the gene expression data for each split)

For each combination of these kinds (where we of course can only consider one kind of methylation data file and one kind of gene expression data file at a time) linear regression models are built, one for each gene, and these are evaluated. This results in a total of 8 (4 x 2) evaluation measurements, one for each combination of the kinds of data. The best performing combination will be chosen and used for the notebooks where we will apply different kinds of machine learning algorithms for both the Distance Analysis part and the CpG Site Analysis part. For each of these 8 combinations of datasets, a different subsection is created below. 

Since the purpose of this notebook is to compare the different combinations of kinds of data files and not to actually assess the results further, we can reduce the computational burden by only building models for the genes which are located on chromosome 1 and select the CpG sites which are present within a distance from 5,000,000 in both directions from the gene. Of course, these exact settings will be used for all the 8 experiments such that the results are directly comparable. These settings are defined below.

In [19]:
# Defining the chromosome setting such that it can be used for all the 8 experiments such that the results are directly 
# comparable.
chromosome_experiments = 1

# Defining the distance setting such that it can be used for all the 8 experiments such that the results are directly 
# comparable.
distance_experiments = 5000000

Next, the subsections follow for which each contain the experiment of one of the 8 different combinations of datasets. To provide a more clear overview, these are the 8 different combinations of datasets:
* The regular methylation data & regular gene expression data
* The regular methylation data & log2-transformed gene expression data
* The regular methylation data & normalized regular gene expression data
* The regular methylation data & normalized log2-transformed gene expression data
* The M-transformed methylation data & regular gene expression data
* The M-transformed methylation data & log2-transformed gene expression data
* The M-transformed methylation data & normalized regular gene expression data
* The M-transformed methylation data & normalized log2-transformed gene expression data

The resulting R<sup>2</sup> values for each of the genes for each of the dataset combinations listed above will be visualized within a single box plot in the section 'Analyzing the Dataset Combinations Results'.

The first thing we can do is run the 'Machine Learning Additional Functions.ipynb' notebook present in the folder 'Machine Learning Algorithms' which contains additional helper functions, such as retrieving the methylation data within a certain distance from a gene, for the machine learning algorithms. This notebook can be run by calling the command '%run' with as argument the notebook.

In [20]:
# Running the notebook 'Machine Learning Additional Functions.ipynb' by calling the command '%run'.
%run "../Machine Learning Additional Functions.ipynb"

Starting the importing of the libraries...
Finishing the installing of the libraries.
The library 'pd' (pandas) has been loaded into the notebook with its version being:
1.2.4
The library 'np' (numpy) has been loaded into the notebook with its version being:
1.20.1

The library 're' has been loaded into the notebook with its version being:
2.2.1

The library 'sys' has been loaded into the notebook with its version being:
3.8.8 (default, Apr 13 2021, 15:08:03) [MSC v.1916 64 bit (AMD64)]
The 'CpG_sites_location_data' DataFrame containing the location data of the CpG sites:
The 'genes_location_data' DataFrame containing the location data of the genes:
The 'chromosomes_length_data' DataFrame containing the lengths of the chromosomes:


#### Defining the functions needed to calculate the R2 scores for every dataset combination

Next, we can define the functions needed to calculate the R<sup>2</sup> scores for every dataset combination such that we can call these for every dataset combinations. There are two kinds of functions we need to define. The first one features gene expression which is not normalized and the second one features gene expression which is normalized. The reason why we need to make a distinction between these kinds of data is that for the normalized gene expression data multiple files need to be taken into account (one for each split) while for the not normalized gene expression data this is just a single file. We can utilize the 'joblib' library to parallelize our code as the computations for a single gene do not influence the computations of any other gene.

In [84]:
# This function calculates the R2 scores for a single 'gene' which is present in the 'gene_expression_data_chromosome1'
# DataFrame.
def calculate_R2_scores(gene, methylation_data, gene_expression_data_chromosome1): 
    
    # Retrieving the gene expression data of the current 'gene' and retrieving the methylation data that is within a 
    # distance of 'distance_experiments' from this gene by calling the function 'get_methylation_data_close_to_gene()' 
    # which is present in the notebook 'Machine Learning Additional Functions.ipynb'.
    gene_expression_data_chromosome1_current_gene = gene_expression_data_chromosome1[['Samples', gene]]
    methylation_data_close_to_current_gene = get_methylation_data_close_to_gene(methylation_data, gene, distance=distance_experiments)

    # Defining a list where all the R2 scores for the current gene will be stored. Since one model is built per fold 
    # (split), the list 'R2_scores_current_gene' will eventually contain 4 elements (R2 scores).
    R2_scores_current_gene = []
    
    # Looping over every column in the 'training_and_test_assignments' DataFrame such that 4-fold cross-validation is 
    # performed using the training and test sets defined within that 'training_and_test_assignments' DataFrame.
    for split in training_and_test_assignments.columns[1:]:

        # Retrieving the samples which below to the training and test set for the current split 'split'.
        selected_samples_train = training_and_test_assignments.loc[training_and_test_assignments[split] == "TRAIN", 'Samples'].tolist()
        selected_samples_test = training_and_test_assignments.loc[training_and_test_assignments[split] == "TEST", 'Samples']

        # Retrieving the gene expression and methylation data of which the samples belong to the training set.
        gene_expression_data_chromosome1_current_gene_train = gene_expression_data_chromosome1_current_gene.loc[gene_expression_data_chromosome1_current_gene['Samples'].isin(selected_samples_train)].drop(columns=['Samples'])
        methylation_data_close_to_current_gene_train = methylation_data_close_to_current_gene.loc[methylation_data_close_to_current_gene['Samples'].isin(selected_samples_train)].drop(columns=['Samples'])

        # Retrieving the gene expression and methylation data of which the samples belong to the test set.
        gene_expression_data_chromosome1_current_gene_test = gene_expression_data_chromosome1_current_gene.loc[gene_expression_data_chromosome1_current_gene['Samples'].isin(selected_samples_test)].drop(columns=['Samples'])
        methylation_data_close_to_current_gene_test = methylation_data_close_to_current_gene.loc[methylation_data_close_to_current_gene['Samples'].isin(selected_samples_test)].drop(columns=['Samples'])

        # Creating a new Linear Regression model by calling the constructor 'LinearRegression()' and calling the function 
        # 'fit()' to train the model with as X-data 'methylation_data_close_to_current_gene_train' and as Y-data 
        # 'gene_expression_data_chromosome1_current_gene_train'.
        model_current_gene = LinearRegression() 
        model_current_gene.fit(methylation_data_close_to_current_gene_train, gene_expression_data_chromosome1_current_gene_train)

        # Predicting the gene expression values based on the 'methylation_data_close_to_current_gene_test' by calling the 
        # function 'predict()'
        gene_expression_data_chromosome1_current_gene_predict = model_current_gene.predict(methylation_data_close_to_current_gene_test)

        # Calculating the R2 score by calling the function 'r2_score()' with the actual values 
        # 'gene_expression_data_chromosome1_current_gene_test' and the predicted values 
        # 'gene_expression_data_chromosome1_current_gene_predict'.
        R2_score = r2_score(gene_expression_data_chromosome1_current_gene_test, gene_expression_data_chromosome1_current_gene_predict)

        # Adding the 'R2_score' value to the 'R2_scores_current_gene' list by calling the function 'append()'.
        R2_scores_current_gene.append(R2_score)

    return np.mean(R2_scores_current_gene)


# This function retrieves the R2 scores for the linear regression models (one for each gene) fitted to predict the   
# 'gene_expression_data' based on the 'methylation_data'.
def linear_regression_for_datasets_combinations(methylation_data, gene_expression_data):
    
    # Retrieving the gene expression data present in the 'gene_expression_data' DataFrame of which the genes are present on 
    # chromosome 1 by calling the function 'get_gene_expression_data_from_chromosome()' present within the notebook 
    # 'Machine Learning Additional Functions.ipynb'.
    gene_expression_data_chromosome1 = get_gene_expression_data_from_chromosome(gene_expression_data, chromosome_experiments)
    genes = gene_expression_data_chromosome1.columns[1:]
    
    # Defining a list where all the R2 scores (one for each gene) will be stored such that we can later represent these
    # within a box plot to compare them with the R2 scores for the other experiments. This can be achieved by calling the
    # function 'calculate_R2_scores()' for each of the genes. Since the computations for a single gene do not influence 
    # the computations of any other gene, we can parallelize the execution of this function by calling the function 
    # 'Parallel()' from the 'joblib' library.
    R2_scores = Parallel(n_jobs=270)(delayed(calculate_R2_scores)(gene, methylation_data, gene_expression_data_chromosome1) for gene in genes)
    
    return R2_scores



In [None]:
# This function calculates the R2 scores for a single 'gene' which is present in the 'gene_expression_data_chromosome1'
# DataFrame.
def calculate_R2_scores_for_normalized_gene_expression(gene, methylation_data, gene_expression_data_split1, gene_expression_data_split2, gene_expression_data_split3, gene_expression_data_split4): 
    
    # Defining a list where all the R2 scores for the current gene will be stored. Since one model is built per fold 
    # (split), the list 'R2_scores_current_gene' will eventually contain 4 elements (R2 scores).
    R2_scores_current_gene = []
    
    # Looping over every column in the 'training_and_test_assignments' DataFrame such that 4-fold cross-validation is 
    # performed using the training and test sets defined within that 'training_and_test_assignments' DataFrame.
    for split in training_and_test_assignments.columns[1:]:
        
        # Retrieving which of the four normalized gene expression data splits to use.
        if split == "Split 1":
            gene_expression_data = gene_expression_data_split1
        elif split == "Split 2":
            gene_expression_data = gene_expression_data_split2
        elif split == "Split 3":
            gene_expression_data = gene_expression_data_split3
        elif split == "Split 4":
            gene_expression_data = gene_expression_data_split4

        # Retrieving the gene expression data present in the 'gene_expression_data' DataFrame of which the genes are 
        # present on chromosome 1 by calling the function 'get_gene_expression_data_from_chromosome()' present within 
        # the notebook 'Machine Learning Additional Functions.ipynb'.
        gene_expression_data_chromosome1 = get_gene_expression_data_from_chromosome(gene_expression_data, chromosome_experiments)

        # Retrieving the gene expression data of the current 'gene' and retrieving the methylation data that is within a 
        # distance of 'distance_experiments' from this gene by calling the function 'get_methylation_data_close_to_gene()' 
        # which is present in the notebook 'Machine Learning Additional Functions.ipynb'.
        gene_expression_data_chromosome1_current_gene = gene_expression_data_chromosome1[['Samples', gene]]
        methylation_data_close_to_current_gene = get_methylation_data_close_to_gene(methylation_data, gene, distance=distance_experiments)

        # Retrieving the samples which below to the training and test set for the current split 'split'.
        selected_samples_train = training_and_test_assignments.loc[training_and_test_assignments[split] == "TRAIN", 'Samples'].tolist()
        selected_samples_test = training_and_test_assignments.loc[training_and_test_assignments[split] == "TEST", 'Samples'].tolist()

        # Retrieving the gene expression and methylation data of which the samples belong to the training set.
        gene_expression_data_chromosome1_current_gene_train = gene_expression_data_chromosome1_current_gene.loc[gene_expression_data_chromosome1_current_gene['Samples'].isin(selected_samples_train)].drop(columns=['Samples'])
        methylation_data_close_to_current_gene_train = methylation_data_close_to_current_gene.loc[methylation_data_close_to_current_gene['Samples'].isin(selected_samples_train)].drop(columns=['Samples'])

        # Retrieving the gene expression and methylation data of which the samples belong to the test set.
        gene_expression_data_chromosome1_current_gene_test = gene_expression_data_chromosome1_current_gene.loc[gene_expression_data_chromosome1_current_gene['Samples'].isin(selected_samples_test)].drop(columns=['Samples'])
        methylation_data_close_to_current_gene_test = methylation_data_close_to_current_gene.loc[methylation_data_close_to_current_gene['Samples'].isin(selected_samples_test)].drop(columns=['Samples'])

        # Creating a new Linear Regression model by calling the constructor 'LinearRegression()' and calling the function 
        # 'fit()' to train the model with as X-data 'methylation_data_close_to_current_gene_train' and as Y-data 
        # 'gene_expression_data_chromosome1_current_gene_train'.
        model_current_gene = LinearRegression() 
        model_current_gene.fit(methylation_data_close_to_current_gene_train, gene_expression_data_chromosome1_current_gene_train)

        # Predicting the gene expression values based on the 'methylation_data_close_to_current_gene_test' by calling the 
        # function 'predict()'
        gene_expression_data_chromosome1_current_gene_predict = model_current_gene.predict(methylation_data_close_to_current_gene_test)

        # Calculating the R2 score by calling the function 'r2_score()' with the actual values 
        # 'gene_expression_data_chromosome1_current_gene_test' and the predicted values 
        # 'gene_expression_data_chromosome1_current_gene_predict'.
        R2_score = r2_score(gene_expression_data_chromosome1_current_gene_test, gene_expression_data_chromosome1_current_gene_predict)

        # Adding the 'R2_score' value to the 'R2_scores_current_gene' list by calling the function 'append()'.
        R2_scores_current_gene.append(R2_score)

    return np.mean(R2_scores_current_gene)


# This function retrieves the R2 scores for the linear regression models (one for each gene) fitted to predict the   
# normalized 'gene_expression_data' based on the 'methylation_data'.
def linear_regression_for_normalized_datasets_combinations(methylation_data, gene_expression_data_split1, gene_expression_data_split2, gene_expression_data_split3, gene_expression_data_split4):
    
    # Retrieving the gene expression data present in the 'gene_expression_data' DataFrame of which the genes are present on 
    # chromosome 1 by calling the function 'get_gene_expression_data_from_chromosome()' present within the notebook 
    # 'Machine Learning Additional Functions.ipynb'.
    gene_expression_data_chromosome1 = get_gene_expression_data_from_chromosome(gene_expression_data, chromosome_experiments)
    genes = gene_expression_data_chromosome1.columns[1:]
    
    # Defining a list where all the R2 scores (one for each gene) will be stored such that we can later represent these
    # within a box plot to compare them with the R2 scores for the other experiments. This can be achieved by calling the
    # function 'calculate_R2_scores()' for each of the genes. Since the computations for a single gene do not influence 
    # the computations of any other gene, we can parallelize the execution of this function by calling the function 
    # 'Parallel()' from the 'joblib' library.
    R2_scores = Parallel(n_jobs=270)(delayed(calculate_R2_scores_for_normalized_gene_expression)(gene, methylation_data, gene_expression_data_split1, gene_expression_data_split2, gene_expression_data_split3, gene_expression_data_split4) for gene in genes)
    
    return R2_scores

### The regular methylation data & regular gene expression data

In [21]:
# Retrieving the R2 scores for the linear regression models (one for each gene) fitted to predict the gene expression 
# values based on the methylation data by calling the function 'linear_regression_for_datasets_combinations()' with as 
# arguments the 'methylation_data' and the 'gene_expression_data' DataFrames.
R2_scores_regular_methylation_regular_gene_expression = linear_regression_for_datasets_combinations(methylation_data, gene_expression_data)

NameError: name 'linear_regression_for_datasets_combinations' is not defined

As mentioned in the beginning of this section, the distributions of the R<sup>2</sup> scores for all the combinations of datasets will be visualized within a box plot in the section 'Analyzing the Dataset Combinations Results'. Something that we can do, however, is to visualize some basic statistics for the list of R<sup>2</sup> scores computed above which is done in the code below.

In [None]:
# Retrieving some basic statistics for the list of R2 scores computed above by calling the functions 'min()', 'max()', 
# 'np.mean()', 'np.median()', 'np.var()', and 'np.std()'.
print("Minimum Value: ", min(R2_scores_regular_methylation_regular_gene_expression))
print("Maximum Value: ", max(R2_scores_regular_methylation_regular_gene_expression))
print("Mean: ", np.mean(R2_scores_regular_methylation_regular_gene_expression))
print("Median: ", np.median(R2_scores_regular_methylation_regular_gene_expression))
print("Variance: ", np.var(R2_scores_regular_methylation_regular_gene_expression))
print("Standard Deviation: ", np.std(R2_scores_regular_methylation_regular_gene_expression))

### The regular methylation data & log2-transformed gene expression data

In [None]:
# Retrieving the R2 scores for the linear regression models (one for each gene) fitted to predict the gene expression 
# values based on the methylation data by calling the function 'linear_regression_for_datasets_combinations()' with as 
# arguments the 'methylation_data' and the 'gene_expression_data_log2_transformed' DataFrames.
R2_scores_regular_methylation_log2_transformed_gene_expression = linear_regression_for_datasets_combinations(methylation_data, gene_expression_data_log2_transformed)

As mentioned in the beginning of this section, the distributions of the R<sup>2</sup> scores for all the combinations of datasets will be visualized within a box plot in the section 'Analyzing the Dataset Combinations Results'. Something that we can do, however, is to visualize some basic statistics for the list of R<sup>2</sup> scores computed above which is done in the code below.

In [None]:
# Retrieving some basic statistics for the list of R2 scores computed above by calling the functions 'min()', 'max()', 
# 'np.mean()', 'np.median()', 'np.var()', and 'np.std()'.
print("Minimum Value: ", min(R2_scores_regular_methylation_log2_transformed_gene_expression))
print("Maximum Value: ", max(R2_scores_regular_methylation_log2_transformed_gene_expression))
print("Mean: ", np.mean(R2_scores_regular_methylation_log2_transformed_gene_expression))
print("Median: ", np.median(R2_scores_regular_methylation_log2_transformed_gene_expression))
print("Variance: ", np.var(R2_scores_regular_methylation_log2_transformed_gene_expression))
print("Standard Deviation: ", np.std(R2_scores_regular_methylation_log2_transformed_gene_expression))

### The regular methylation data & normalized regular gene expression data

In [None]:
# Retrieving the R2 scores for the linear regression models (one for each gene) fitted to predict the gene expression 
# values based on the methylation data by calling the function 'linear_regression_for_normalized_datasets_combinations()' 
# with as arguments the 'methylation_data' DataFrame and the gene expression 'gene_expression_data_normalized_split1', 
# 'gene_expression_data_normalized_split2', 'gene_expression_data_normalized_split3', and 'gene_expression_data_normalized_split4'
# DataFrames.
R2_scores_regular_methylation_normalized_regular_gene_expression = linear_regression_for_normalized_datasets_combinations(methylation_data, gene_expression_data_normalized_split1, gene_expression_data_normalized_split2, gene_expression_data_normalized_split3, gene_expression_data_normalized_split4)

As mentioned in the beginning of this section, the distributions of the R<sup>2</sup> scores for all the combinations of datasets will be visualized within a box plot in the section 'Analyzing the Dataset Combinations Results'. Something that we can do, however, is to visualize some basic statistics for the list of R<sup>2</sup> scores computed above which is done in the code below.

In [None]:
# Retrieving some basic statistics for the list of R2 scores computed above by calling the functions 'min()', 'max()', 
# 'np.mean()', 'np.median()', 'np.var()', and 'np.std()'.
print("Minimum Value: ", min(R2_scores_regular_methylation_normalized_regular_gene_expression))
print("Maximum Value: ", max(R2_scores_regular_methylation_normalized_regular_gene_expression))
print("Mean: ", np.mean(R2_scores_regular_methylation_normalized_regular_gene_expression))
print("Median: ", np.median(R2_scores_regular_methylation_normalized_regular_gene_expression))
print("Variance: ", np.var(R2_scores_regular_methylation_normalized_regular_gene_expression))
print("Standard Deviation: ", np.std(R2_scores_regular_methylation_normalized_regular_gene_expression))

### The regular methylation data & normalized log2-transformed gene expression data

In [None]:
# Retrieving the R2 scores for the linear regression models (one for each gene) fitted to predict the gene expression 
# values based on the methylation data by calling the function 'linear_regression_for_normalized_datasets_combinations()' 
# with as arguments the 'methylation_data' DataFrame and the gene expression 'gene_expression_data_log2_transformed_normalized_split1', 
# 'gene_expression_data_log2_transformed_normalized_split2', 'gene_expression_data_log2_transformed_normalized_split3', and 
# 'gene_expression_data_log2_transformed_normalized_split4' DataFrames.
R2_scores_regular_methylation_normalized_log2_transformed_gene_expression = linear_regression_for_normalized_datasets_combinations(methylation_data, gene_expression_data_log2_transformed_normalized_split1, gene_expression_data_log2_transformed_normalized_split2, gene_expression_data_log2_transformed_normalized_split3, gene_expression_data_log2_transformed_normalized_split4)

As mentioned in the beginning of this section, the distributions of the R<sup>2</sup> scores for all the combinations of datasets will be visualized within a box plot in the section 'Analyzing the Dataset Combinations Results'. Something that we can do, however, is to visualize some basic statistics for the list of R<sup>2</sup> scores computed above which is done in the code below.

In [None]:
# Retrieving some basic statistics for the list of R2 scores computed above by calling the functions 'min()', 'max()', 
# 'np.mean()', 'np.median()', 'np.var()', and 'np.std()'.
print("Minimum Value: ", min(R2_scores_regular_methylation_normalized_log2_transformed_gene_expression))
print("Maximum Value: ", max(R2_scores_regular_methylation_normalized_log2_transformed_gene_expression))
print("Mean: ", np.mean(R2_scores_regular_methylation_normalized_log2_transformed_gene_expression))
print("Median: ", np.median(R2_scores_regular_methylation_normalized_log2_transformed_gene_expression))
print("Variance: ", np.var(R2_scores_regular_methylation_normalized_log2_transformed_gene_expression))
print("Standard Deviation: ", np.std(R2_scores_regular_methylation_normalized_log2_transformed_gene_expression))

### The M-transformed methylation data & regular gene expression data

In [None]:
# Retrieving the R2 scores for the linear regression models (one for each gene) fitted to predict the gene expression 
# values based on the methylation data by calling the function 'linear_regression_for_datasets_combinations()' with as 
# arguments the 'methylation_data_M_transformed' and the 'gene_expression_data' DataFrames.
R2_scores_M_transformed_methylation_regular_gene_expression = linear_regression_for_datasets_combinations(methylation_data_M_transformed, gene_expression_data)

As mentioned in the beginning of this section, the distributions of the R<sup>2</sup> scores for all the combinations of datasets will be visualized within a box plot in the section 'Analyzing the Dataset Combinations Results'. Something that we can do, however, is to visualize some basic statistics for the list of R<sup>2</sup> scores computed above which is done in the code below.

In [None]:
# Retrieving some basic statistics for the list of R2 scores computed above by calling the functions 'min()', 'max()', 
# 'np.mean()', 'np.median()', 'np.var()', and 'np.std()'.
print("Minimum Value: ", min(R2_scores_M_transformed_methylation_regular_gene_expression))
print("Maximum Value: ", max(R2_scores_M_transformed_methylation_regular_gene_expression))
print("Mean: ", np.mean(R2_scores_M_transformed_methylation_regular_gene_expression))
print("Median: ", np.median(R2_scores_M_transformed_methylation_regular_gene_expression))
print("Variance: ", np.var(R2_scores_M_transformed_methylation_regular_gene_expression))
print("Standard Deviation: ", np.std(R2_scores_M_transformed_methylation_regular_gene_expression))

### The M-transformed methylation data & log2-transformed gene expression data

In [None]:
# Retrieving the R2 scores for the linear regression models (one for each gene) fitted to predict the gene expression 
# values based on the methylation data by calling the function 'linear_regression_for_datasets_combinations()' with as 
# arguments the 'methylation_data_M_transformed' and the 'gene_expression_data_log2_transformed' DataFrames.
R2_scores_M_transformed_methylation_log2_transformed_gene_expression = linear_regression_for_datasets_combinations(methylation_data_M_transformed, gene_expression_data_log2_transformed)

As mentioned in the beginning of this section, the distributions of the R<sup>2</sup> scores for all the combinations of datasets will be visualized within a box plot in the section 'Analyzing the Dataset Combinations Results'. Something that we can do, however, is to visualize some basic statistics for the list of R<sup>2</sup> scores computed above which is done in the code below.

In [None]:
# Retrieving some basic statistics for the list of R2 scores computed above by calling the functions 'min()', 'max()', 
# 'np.mean()', 'np.median()', 'np.var()', and 'np.std()'.
print("Minimum Value: ", min(R2_scores_M_transformed_methylation_log2_transformed_gene_expression))
print("Maximum Value: ", max(R2_scores_M_transformed_methylation_log2_transformed_gene_expression))
print("Mean: ", np.mean(R2_scores_M_transformed_methylation_log2_transformed_gene_expression))
print("Median: ", np.median(R2_scores_M_transformed_methylation_log2_transformed_gene_expression))
print("Variance: ", np.var(R2_scores_M_transformed_methylation_log2_transformed_gene_expression))
print("Standard Deviation: ", np.std(R2_scores_M_transformed_methylation_log2_transformed_gene_expression))

### The M-transformed methylation data & normalized regular gene expression data

In [None]:
# Retrieving the R2 scores for the linear regression models (one for each gene) fitted to predict the gene expression 
# values based on the methylation data by calling the function 'linear_regression_for_normalized_datasets_combinations()' 
# with as arguments the 'methylation_data_M_transformed' DataFrame and the gene expression 'gene_expression_data_normalized_split1', 
# 'gene_expression_data_normalized_split2', 'gene_expression_data_normalized_split3', and 'gene_expression_data_normalized_split4'
# DataFrames.
R2_scores_M_transformed_methylation_normalized_regular_gene_expression = linear_regression_for_normalized_datasets_combinations(methylation_data_M_transformed, gene_expression_data_normalized_split1, gene_expression_data_normalized_split2, gene_expression_data_normalized_split3, gene_expression_data_normalized_split4)

As mentioned in the beginning of this section, the distributions of the R<sup>2</sup> scores for all the combinations of datasets will be visualized within a box plot in the section 'Analyzing the Dataset Combinations Results'. Something that we can do, however, is to visualize some basic statistics for the list of R<sup>2</sup> scores computed above which is done in the code below.

In [None]:
# Retrieving some basic statistics for the list of R2 scores computed above by calling the functions 'min()', 'max()', 
# 'np.mean()', 'np.median()', 'np.var()', and 'np.std()'.
print("Minimum Value: ", min(R2_scores_M_transformed_methylation_normalized_regular_gene_expression))
print("Maximum Value: ", max(R2_scores_M_transformed_methylation_normalized_regular_gene_expression))
print("Mean: ", np.mean(R2_scores_M_transformed_methylation_normalized_regular_gene_expression))
print("Median: ", np.median(R2_scores_M_transformed_methylation_normalized_regular_gene_expression))
print("Variance: ", np.var(R2_scores_M_transformed_methylation_normalized_regular_gene_expression))
print("Standard Deviation: ", np.std(R2_scores_M_transformed_methylation_normalized_regular_gene_expression))

### The M-transformed methylation data & normalized log2-transformed gene expression data

In [None]:
# Retrieving the R2 scores for the linear regression models (one for each gene) fitted to predict the gene expression 
# values based on the methylation data by calling the function 'linear_regression_for_normalized_datasets_combinations()' 
# with as arguments the 'methylation_data_M_transformed' DataFrame and the gene expression 'gene_expression_data_log2_transformed_normalized_split1', 
# 'gene_expression_data_log2_transformed_normalized_split2', 'gene_expression_data_log2_transformed_normalized_split3', and 
# 'gene_expression_data_log2_transformed_normalized_split4' DataFrames.
R2_scores_M_transformed_methylation_normalized_log2_transformed_gene_expression = linear_regression_for_normalized_datasets_combinations(methylation_data_M_transformed, gene_expression_data_log2_transformed_normalized_split1, gene_expression_data_log2_transformed_normalized_split2, gene_expression_data_log2_transformed_normalized_split3, gene_expression_data_log2_transformed_normalized_split4)

As mentioned in the beginning of this section, the distributions of the R<sup>2</sup> scores for all the combinations of datasets will be visualized within a box plot in the section 'Analyzing the Dataset Combinations Results'. Something that we can do, however, is to visualize some basic statistics for the list of R<sup>2</sup> scores computed above which is done in the code below.

In [None]:
# Retrieving some basic statistics for the list of R2 scores computed above by calling the functions 'min()', 'max()', 
# 'np.mean()', 'np.median()', 'np.var()', and 'np.std()'.
print("Minimum Value: ", min(R2_scores_M_transformed_methylation_normalized_log2_transformed_gene_expression))
print("Maximum Value: ", max(R2_scores_M_transformed_methylation_normalized_log2_transformed_gene_expression))
print("Mean: ", np.mean(R2_scores_M_transformed_methylation_normalized_log2_transformed_gene_expression))
print("Median: ", np.median(R2_scores_M_transformed_methylation_normalized_log2_transformed_gene_expression))
print("Variance: ", np.var(R2_scores_M_transformed_methylation_normalized_log2_transformed_gene_expression))
print("Standard Deviation: ", np.std(R2_scores_M_transformed_methylation_normalized_log2_transformed_gene_expression))

## Analyzing the Dataset Combinations Results

Next, we can define a DataFrame to which all of the lists featuring the R<sup>2</sup> scores for all the 8 dataset combinations are added such that the boxplots (one for each dataset combination) can be easily created after.

In [None]:
# Creating a dictionary containing the lists of R2 scores for all the 8 dataset combinations. 
R2_data = {
    'RM & RG': R2_scores_regular_methylation_regular_gene_expression,
    'RM & TG': R2_scores_regular_methylation_log2_transformed_gene_expression,
    'RM & NRG': R2_scores_regular_methylation_normalized_regular_gene_expression,
    'RM & NTG': R2_scores_regular_methylation_normalized_log2_transformed_gene_expression,
    'TM & RG': R2_scores_M_transformed_methylation_regular_gene_expression,
    'TM & TG': R2_scores_M_transformed_methylation_log2_transformed_gene_expression,
    'TM & NRG': R2_scores_M_transformed_methylation_normalized_regular_gene_expression,
    'TM & NTG': R2_scores_M_transformed_methylation_normalized_log2_transformed_gene_expression,
}

# Creating the DataFrame by calling the constructor 'DataFrame()' which takes as input the data to be converted to the 
# DataFrame called 'R2_data'.
R2_df = pd.DataFrame(R2_data)

# Defining where to save the resulting file and its name.
file_to_save = data_directory_results_CpG + "/R2_scores_dataset_combinations.csv"

# Saving the file.
if not os.path.exists(file_to_save):
    R2_df.to_csv(file_to_save, index=False)
    print("The file with the path " + file_to_save + " has been created.")
else:
    print("The file with the path " + file_to_save + " already exists.")

print("The filled in 'R2_df' DataFrame:")
R2_df

Here abbrevations are used of which the overview below contains the meanings of the letters used:
* 'M': Methylation
* 'G': Gene Expression
* 'R': Regular
* 'T': Transformed (M-transformed for methylation data and log2-transformed for gene expression data)
* 'N': Normalized

Using the 'R2_df' DataFrame to which all of the lists have been added, we can now create the boxplots (one for each dataset combination) by calling the function 'boxplot()' from the 'Seaborn' library. We can also save this plot to the directory 'data_directory_results_CpG' by calling the function 'savefig()'.

Although the name of the metric R<sup>2</sup> may suggest that the R<sup>2</sup> value will always be positive, R<sup>2</sup> can also be a negative value as a model can arbitrarily predict worse than when just choosing a random value or the mean (which is seen as a R<sup>2</sup> value being equal to 0) [4].

In [22]:
plt.figure(figsize=(20, 12))

# Creating a boxplot for every column (sample) in the 'R2_df' DataFrame, plotting them on the same axis, without 
# showing the outliers.
ax = sns.boxplot(data=R2_df, showfliers=False)
ax.set_xticklabels(ax.get_xticklabels(), rotation=45, ha='right', fontsize=24)
ax.tick_params(axis='y', labelsize=24)

# Adding the legend, a title and the labels to the plot.
ax.set_title('The Distributions of the R2 scores of the Dataset Combinations', pad=20, fontsize=30)
ax.set_xlabel('Dataset Combination', labelpad=20, fontsize=30)
ax.set_ylabel('R2 score', labelpad=20, fontsize=30)

# Saving the plot by calling the function 'savefig()'.
file_to_save = data_directory_results_CpG + f"/R2_scores_dataset_combinations.png"
plt.savefig(file_to_save, bbox_inches='tight')

# Show the plot
plt.show()

NameError: name 'R2_df' is not defined

<Figure size 1440x864 with 0 Axes>

In addition, we can also display how many positive R<sup>2</sup> scores are present in the boxplots above for each of the dataset combinations. This can be achieved by calling the function 'sum()' after calling the function 'gt(0)'.

In [1]:
# Retrieving the number of positive R2 scores.
number_of_positive_R2_scores = R2_df.gt(0).sum()

print("The number of positive R2 scores:")
number_of_positive_R2_scores

NameError: name 'R2_df' is not defined

The first thing we can notice from the boxplots above is that the all the dataset combinations have quite some negative R<sup>2</sup> scores. As mentioned before, this can happen and basically means that a model is performing worse than just predicting the mean. However, we are not assessing the performance of the individual linear regression models but rather want to conclude which of the dataset combinations performs best by comparing their R<sup>2</sup> scores distribution.

What we can conclude is that the dataset combinations featuring some kind of regular gene expression data (either normalized or not normalized) generally perform the worst among the 8 different dataset combinations experimented with. The best performing dataset combinations seem to be the ones featuring the log2-transformed gene expression data (either normalized or not normalized). The distributions of these four dataset combinations called 'The regular methylation data & log2-transformed gene expression data', 'The M-transformed methylation data & log2-transformed gene expression data', 'The regular methylation data & normalized log2-transformed gene expression data', and 'The M-transformed methylation data & normalized log2-transformed gene expression data' look fairly similar to each other and there would not be a big difference in performance when choosing one over the other for the next steps in the machine learning algorithms application. I have therefore based my decision on which of the medians of the distributions is the highest which is the case for the dataset combination 'M-transformed methylation data & log2-transformed gene expression data' meaning that this will be the dataset combination used throughout the remainder of both the Distance Analysis and the CpG Site Analysis part.