# Ridge Regression for Distance Analysis
### Laurence Nickel (i6257119)

Libraries used: 
* pandas (version: '1.2.4')
* re (version: '2.2.1')
* sys (version: '3.8.8')
* os (version: '3.8.8')
* time (version: '3.8.8')
* plotly.express (version: '5.13.1')
* seaborn (version: '0.11.1')
* matplotlib.pyplot (version: '3.3.4')
* sklearn.linear_model (version: '0.24.1')
* sklearn.metrics (version: '0.24.1')
* joblib (version: '1.0.1')

References:
* [1] McDonald, G. C. (2009). Ridge regression. *Wiley Interdisciplinary Reviews: Computational Statistics, 1*(1), 93–100. doi: https://doi.org/10.1002/wics.14.
* [2] Hong, J., \& Rhee, J. (2022). Genomic Effect of DNA Methylation on Gene Expression in Colorectal Cancer. *Biology (Basel), 11*(10): 1388. doi: 10.3390/biology11101388.
* [3] Miles, J. (2005). "R-squared, Adjusted R-squared," in *Encyclopedia of Statistics in Behavioral Science - Volume 4*, eds B. S. Everitt \& D. C. Howell (Hoboken, NJ, USA: John Wiley \& Sons), 1655-1657. doi: https://doi.org/10.1002/0470013192.bsa526.

## Introduction

Within this notebook, the machine learning algorithm ridge regression is performed to predict the expression levels for the genes considering the methylation values for the CpG sites where different distances are experimented with the goal of finding which of the distances is the best to use, which denotes which CpG to use based on their location with respect to that gene, to predict the gene expression values. The dataset combination that is used to achieve this is the one with the M-transformed methylation data and the log2-transformed gene expression data. This was determined to be the best performing dataset combination within the notebook 'Linear Regression for Testing the Datasets.ipynb'. 

Ridge regression is a linear regression method that uses a regularization term to prevent overfitting and improve the model’s generalization [1]. Just like linear regression, ridge regression assumes a linear relationship between the independent variables and the dependent variable, but by including polynomials, it can also effectively model non-linear relationships. A line of best fit is found to relate a dependent variable to one or more independent variables, a linear equation that minimizes the sum of squared residuals. This is presented in Equation 1 where $Y$ represents the predicted gene expression value, $\beta_0$ is the intercept (bias term), $\beta_1, \beta_2, \ldots, \beta_n$ are the coefficients corresponding to each CpG site, and $\epsilon$ denotes the error term.

<br></br>
\begin{equation}
Y = \beta_0 + \beta_1 \cdot CpG_1 + \beta_2 \cdot CpG_2 + \ldots + \beta_n \cdot CpG_n + \epsilon\tag{1}
\end{equation}
<br></br>

The method Ordinary Least Squares (OLS) can be employed to estimate the coefficients $\beta_1, \beta_2, \ldots, \beta_n$. The OLS estimator aims to reduce the squared sum of differences between the observed and predicted values for gene expression. To this OLS estimator utilized by the linear regression, ridge regularization is added. This is presented in Equation 2 where $\hat{\beta}$ represents the estimated coefficient values that minimize the sum of squared errors and $N$ the number of samples.

<br></br>
\begin{equation}
\hat{\beta} = \arg\min_{\beta} \sum_{i=1}^N (Y_i - \beta_0 - \beta_1 \cdot CpG_{1i} - \beta_2 \cdot CpG_{2i} - \ldots - \beta_n \cdot CpG_{ni})^2 + \lambda \sum_{j=1}^{n} \beta_j^2\tag{2}
\end{equation}
<br></br>

The obtain the coefficients $\beta_1, \beta_2, \ldots, \beta_n$, the optimization problem in Equation 2 can be solved. Here, the $n$ represents the number of CpG sites, and the regularization parameter $\lambda$ controls the regularization strength. The term $\lambda \sum_{j=1}^{n} \beta_j^2$ introduces a penalty on the squared values of the coefficients, which encourages some coefficients to be forced towards zero. By adjusting the value of $\lambda$, ridge regression can find an optimal balance between minimizing the error and reducing the impact of less influential CpG sites. This results in a more robust and stable model.

Regarding the suitability of ridge regression for this thesis, it can be applied to predict gene expression values from methylation data as it has already been successfully performed before, showing that ridge regression is suitable for working with gene expression and methylation data [2]. Please mind that this does not mean that ridge regression is necessarily the best performing (regression) method for predicting gene expression values from methylation data, but applying the algorithm might provide us with reasonable results for our purpose of finding the distance with which the dataset combination performs the best.

There are multiple distances (40 in total) that we will experiment with throughout this notebook. These are presented in the overview below:
* 5,000
* 10,000
* 15,000
* 25,000
* 50,000
* 75,000
* 100,000
* 150,000
* 250,000
* 350,000
* 500,000
* 750,000
* 1,000,000
* 1,500,000
* 2,000,000
* 2,500,000
* 4,000,000
* 5,000,000
* 6,000,000
* 7,500,000
* 10,000,000
* 12,500,000
* 15,000,000
* 17,500,000
* 20,000,000
* 25,000,000
* 30,000,000
* 40,000,000
* 50,000,000
* 65,000,000
* 80,000,000
* 100,000,000
* 120,000,000
* 150,000,000
* 200,000,000
* 250,000,000
* 350,000,000
* 500,000,000
* 750,000,000
* 1,000,000,000

For each of these distances ridge regression models are built, one for each gene, and these are evaluated. The distance of the best performing experiment will represent which distance we should use to determine which CpG sites to consider (based on their position relative to a particular gene) for predicting the gene expression value of a particular gene.

Unlike we did in the notebook 'Linear Regression for Testing the Datasets.ipynb', we will not only build models for the genes which are located on chromosome 1, which was done to reduce the computational burden, but instead we will use all of the genes present within the log2-transformed gene expression data since this notebook (among the other machine learning algorithms notebooks within this directory) represents the main experiments of the Distance Analysis part.

To retrieve the prediction accuracy of the gene expression values, the R-squared (R<sup>2</sup>) metric is computed for each of the predictions which indicates the proportion of the variance in the dependent variable that can be explained by the model [3]. Higher R<sup>2</sup> values indicate a more significant proportion of the variance in the dependent variable that can be explained by the model, with 1 being the largest possible value. This R<sup>2</sup> value is retrieved by applying 4-fold cross-validation using the training and test splits defined in the notebook 'Training and Test Set Division.ipynb' present in the 'Machine Learning Algorithms - Preprocessing' folder, which also includes the motivation behind choosing the k in k-fold cross-validation to be set equal to four, and averaging the R<sup>2</sup> value for each of the four folds.

### Importing libraries

Before we can start to define all the functions, we should first import some libraries that will be used throughout this notebook.

In [3]:
print("Starting the importing of the libraries...")


import pandas as pd
import re
import sys
import os
import time

# Here we first need to install the plotly library.
!pip install plotly
import plotly
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns

import sklearn
from sklearn.linear_model import Ridge
from sklearn.metrics import r2_score

!pip install joblib
import joblib
from joblib import Parallel, delayed


print("Finishing the installing of the libraries.")

Starting the importing of the libraries...



Now that all the libraries have been imported, we can verify that these libraries have been loaded into this notebook by calling the version property of the library.

In [4]:
# Retrieving the version of the libraries to verify they have been correctly loaded into this notebook.
print("The library 'pd' (pandas) has been loaded into the notebook with its version being:")
print(pd.__version__)

print("\nThe library 're' has been loaded into the notebook with its version being:")
print(re.__version__)

print("\nThe library 'sys' has been loaded into the notebook with its version being:")
print(sys.version)

print("\nThe library 'plotly' has been loaded into the notebook with its version being:")
print(plotly.__version__)

print("\nThe library 'sns' (seaborn) has been loaded into the notebook with its version being:")
print(sns.__version__)

print("\nThe library 'matplotlib' has been loaded into the notebook with its version being:")
print(matplotlib.__version__)

print("\nThe library 'sklearn' has been loaded into the notebook with its version being:")
print(sklearn.__version__)

print("\nThe library 'joblib' has been loaded into the notebook with its version being:")
print(joblib.__version__)

The library 'pd' (pandas) has been loaded into the notebook with its version being:
1.2.4

The library 're' has been loaded into the notebook with its version being:
2.2.1

The library 'sys' has been loaded into the notebook with its version being:
3.8.8 (default, Apr 13 2021, 15:08:03) [MSC v.1916 64 bit (AMD64)]

The library 'plotly' has been loaded into the notebook with its version being:
5.13.1

The library 'sns' (seaborn) has been loaded into the notebook with its version being:
0.11.1

The library 'matplotlib' has been loaded into the notebook with its version being:
3.3.4

The library 'sklearn' has been loaded into the notebook with its version being:
0.24.1

The library 'joblib' has been loaded into the notebook with its version being:
1.0.1


### Defining the data directories

In addition, we also need to define our data directories from which the gene expression and methylation data files and the training and test splits data files will be loaded. Please mind that these need to be changed to the desired directories to be able to work with the data directories.

In [5]:
data_directory_final_datasets = "C:/Users/laure/OneDrive/Documenten/Bachelor Thesis Data/final_datasets"
data_directory_final_datasets_distance = "C:/Users/laure/OneDrive/Documenten/Bachelor Thesis Data/final_datasets/Distance Analysis"
data_directory_training_and_test_splits = "C:/Users/laure/OneDrive/Documenten/Bachelor Thesis Data/training_and_test_splits"
data_directory_results_distance = "C:/Users/laure/OneDrive/Documenten/Bachelor Thesis Data/results/Distance Analysis/Ridge Regression"

## Loading Training and Test Split Data

Within this section, we can load the training and test split data files from the directory 'data_directory_training_and_test_splits' into this notebook by calling the function 'pd.read_csv()' with as a parameter the to be read file.

#### Loading the 'fold_assignments_samples.csv' file into this notebook

In [6]:
# Loading the file 'fold_assignments_samples.csv'.
fold_assignments = pd.read_csv(data_directory_training_and_test_splits + '/fold_assignments_samples.csv')

print("The 'fold_assignments' DataFrame:")
fold_assignments

The 'fold_assignments' DataFrame:


Unnamed: 0,Samples,Fold
0,TCGA-06-0125-01A-01,1
1,TCGA-06-0125-02A-11,1
2,TCGA-06-0152-02A-01,2
3,TCGA-06-0171-02A-11,1
4,TCGA-06-0190-01A-01,4
...,...,...
59,TCGA-76-4927-01A-01,1
60,TCGA-76-4928-01B-01,3
61,TCGA-76-4929-01A-01,2
62,TCGA-76-4931-01A-01,2


#### Loading the 'training_and_test_assignments_samples.csv' file into this notebook

In [7]:
# Loading the file 'training_and_test_assignments_samples.csv'.
training_and_test_assignments = pd.read_csv(data_directory_training_and_test_splits + '/training_and_test_assignments_samples.csv')

print("The 'training_and_test_assignments' DataFrame:")
training_and_test_assignments

The 'training_and_test_assignments' DataFrame:


Unnamed: 0,Samples,Split 1,Split 2,Split 3,Split 4
0,TCGA-06-0125-01A-01,TEST,TRAIN,TRAIN,TRAIN
1,TCGA-06-0125-02A-11,TEST,TRAIN,TRAIN,TRAIN
2,TCGA-06-0152-02A-01,TRAIN,TEST,TRAIN,TRAIN
3,TCGA-06-0171-02A-11,TEST,TRAIN,TRAIN,TRAIN
4,TCGA-06-0190-01A-01,TRAIN,TRAIN,TRAIN,TEST
...,...,...,...,...,...
59,TCGA-76-4927-01A-01,TEST,TRAIN,TRAIN,TRAIN
60,TCGA-76-4928-01B-01,TRAIN,TRAIN,TEST,TRAIN
61,TCGA-76-4929-01A-01,TRAIN,TEST,TRAIN,TRAIN
62,TCGA-76-4931-01A-01,TRAIN,TEST,TRAIN,TRAIN


## Loading all the Different Datasets

Within this section, the datasets present within the best performing dataset combination are loaded into this notebook:
* The M-transformed methylation data file
* The log2-transformed gene expression data file

These are present in the directories 'data_directory_final_datasets' and 'data_directory_final_datasets_distance'. For each of the corresponding files, this can be achieved by calling the function 'pd.read_csv()' with as a parameter the to be read file.

#### Loading the 'methylation_data_M_transformed_final.csv' file into this notebook

In [8]:
# Loading the file 'methylation_data_M_transformed_final.csv'.
methylation_data_M_transformed = pd.read_csv(data_directory_final_datasets + '/methylation_data_M_transformed_final.csv')

print("The 'methylation_data_M_transformed' DataFrame:")
methylation_data_M_transformed

The 'methylation_data_M_transformed' DataFrame:


Unnamed: 0,Samples,cg00000957,cg00001349,cg00001583,cg00002837,cg00003287,cg00004121,cg00008647,cg00009292,cg00011717,...,ch.22.28920330F,ch.22.436090R,ch.22.441164F,ch.22.528917R,ch.22.569473R,ch.22.707049R,ch.22.728807R,ch.22.734399R,ch.22.772318F,ch.22.909671F
0,TCGA-06-0125-01A-01,3.132755,3.960518,-5.452737,2.348422,0.642046,0.425173,-3.568310,0.090094,4.677447,...,-4.328807,-1.604700,-5.391144,-4.299671,-3.299155,-5.006189,-4.483695,-1.512707,-4.935511,-4.526937
1,TCGA-06-0125-02A-11,3.196057,3.825019,-5.503606,1.372434,0.849407,0.629880,-3.292764,1.242929,5.700119,...,-2.854379,-1.720475,-5.169584,-4.430285,-3.100187,-4.953307,-3.404266,-1.763778,-4.931599,-3.668569
2,TCGA-06-0152-02A-01,4.057813,3.626717,1.146710,-0.208610,0.059986,0.788350,-0.941862,1.416831,5.731835,...,-4.614915,-2.221432,-5.487836,-4.947837,-2.324764,-4.632719,-4.005875,-1.983516,-4.787276,-3.809136
3,TCGA-06-0171-02A-11,4.139295,3.058785,-1.109405,-0.179801,-2.005682,0.634832,-4.202176,-0.933237,6.317184,...,-3.329587,-3.007868,-5.728251,-3.993713,-3.442227,-4.986055,-3.436608,-2.831527,-5.334004,-4.291577
4,TCGA-06-0190-01A-01,3.179215,3.408476,-4.613496,-0.233366,-0.672902,0.370326,-1.536587,0.712060,4.594485,...,-3.925038,-2.666686,-5.455511,-4.393648,-3.584784,-5.394690,-3.741805,-2.292808,-4.840999,-3.959180
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
59,TCGA-76-4927-01A-01,3.427987,4.101014,-0.268118,-0.005975,-0.769480,0.898056,-5.020576,1.241059,5.889282,...,-4.408250,-1.532935,-3.325032,-2.215352,-3.453181,-3.190405,-3.143200,-0.768262,-2.643066,-1.345063
60,TCGA-76-4928-01B-01,2.091628,3.824918,-5.990499,0.344366,-0.717767,0.425782,-5.487022,0.370565,5.714939,...,-5.370823,-1.044793,-4.456032,-2.967873,-3.687673,-3.546754,-3.881002,-1.597628,-4.209691,-2.319970
61,TCGA-76-4929-01A-01,3.166884,3.437609,-6.054600,1.700659,-4.722319,0.514961,-5.351739,1.562834,3.495553,...,-4.206406,-1.537983,-3.914480,-2.712960,-3.349122,-2.192265,-2.538937,-1.356304,-2.419188,-1.598524
62,TCGA-76-4931-01A-01,2.464759,3.631037,3.468339,0.585791,0.666281,0.879107,-5.656808,3.381448,4.041543,...,-4.784318,-1.500444,-4.241542,-4.235881,-3.584725,-2.905236,-3.847968,-0.846516,-4.245142,-1.847410


#### Loading the 'gene_expression_data_log2_transformed_distance_correlation_genes_removed.csv' file into this notebook

In [9]:
# Loading the file 'gene_expression_data_log2_transformed_distance_correlation_genes_removed.csv'.
gene_expression_data_log2_transformed = pd.read_csv(data_directory_final_datasets_distance + '/gene_expression_data_log2_transformed_distance_correlation_genes_removed.csv')

print("The 'gene_expression_data_log2_transformed' DataFrame:")
gene_expression_data_log2_transformed

The 'gene_expression_data_log2_transformed' DataFrame:


Unnamed: 0,Samples,ENSG00000001561,ENSG00000001629,ENSG00000001631,ENSG00000002587,ENSG00000002746,ENSG00000004487,ENSG00000004534,ENSG00000004777,ENSG00000005007,...,ENSG00000287064,ENSG00000287151,ENSG00000287263,ENSG00000287562,ENSG00000287828,ENSG00000287893,ENSG00000288156,ENSG00000288586,ENSG00000288612,ENSG00000288658
0,TCGA-06-0125-01A-01,1.800910,6.291825,3.187578,4.739843,0.840443,6.443154,6.150584,5.753300,6.185981,...,0.383055,0.998123,2.884598,2.785362,1.155102,0.662205,1.714795,1.547549,2.704385,0.767655
1,TCGA-06-0125-02A-11,3.646969,6.455447,2.624943,3.701738,1.102591,5.885086,4.452912,4.893236,6.229463,...,1.371670,0.805292,2.562279,1.262373,0.684101,1.156397,2.349592,1.382944,1.640852,0.700617
2,TCGA-06-0152-02A-01,3.470029,6.264604,2.771886,2.792501,1.866354,6.251889,5.459441,5.540536,6.376085,...,2.235666,1.427284,1.817582,1.778503,2.435389,0.268075,2.293047,1.481764,2.196733,1.697285
3,TCGA-06-0171-02A-11,5.374226,5.630117,2.075122,2.951569,0.387693,4.875726,4.920541,3.527258,5.813684,...,0.524866,1.275186,3.038612,1.359971,0.718526,0.000000,4.087887,0.931078,2.126444,0.766722
4,TCGA-06-0190-01A-01,4.371350,5.433400,2.516519,2.700994,0.483777,6.042990,5.195911,5.074073,5.765248,...,0.504570,1.549620,3.181452,1.369606,2.538563,0.603976,3.092427,0.928048,1.875387,0.229834
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
59,TCGA-76-4927-01A-01,3.829444,6.280969,2.682978,3.194339,1.531768,6.083558,5.055677,5.403540,6.111363,...,1.157367,2.028640,3.703887,1.156526,1.974860,1.064469,2.707591,1.171783,1.187134,1.310689
60,TCGA-76-4928-01B-01,3.613025,5.277862,2.277092,2.464747,0.854634,6.027249,4.900635,4.202339,5.355739,...,0.431035,1.828022,2.732334,1.098824,1.462733,0.000000,2.435789,0.440952,1.622790,0.516923
61,TCGA-76-4929-01A-01,4.978159,6.151981,2.740604,2.949591,0.632827,7.047989,5.192352,4.957706,5.554177,...,0.276556,2.139044,4.254881,2.153546,1.410287,2.314174,0.731965,2.031395,1.956837,2.229649
62,TCGA-76-4931-01A-01,3.653003,6.518490,3.018225,3.815545,1.211884,6.588232,6.448640,6.178647,6.011404,...,1.364012,2.444163,2.508403,2.561839,2.329698,1.014141,1.270589,1.569199,3.349351,0.928579


## Ridge Regression

Within this section, ridge regression is applied for the dataset combination loaded into this notebook above which is performed multiple times, once for each distance defined below:
* 5,000 
* 10,000
* 15,000
* 25,000
* 50,000
* 75,000
* 100,000
* 150,000
* 250,000
* 350,000
* 500,000
* 750,000
* 1,000,000
* 1,500,000
* 2,000,000
* 2,500,000
* 4,000,000
* 5,000,000
* 6,000,000
* 7,500,000
* 10,000,000
* 12,500,000
* 15,000,000
* 17,500,000
* 20,000,000
* 25,000,000
* 30,000,000
* 40,000,000
* 50,000,000
* 65,000,000
* 80,000,000
* 100,000,000
* 120,000,000
* 150,000,000
* 200,000,000
* 250,000,000
* 350,000,000
* 500,000,000
* 750,000,000
* 1,000,000,000

For each of these distances ridge regression models are built, one for each gene, and these are evaluated. The distance of the best performing experiment will represent which distance we should use to determine which CpG sites to consider (based on their position relative to a particular gene) for predicting the gene expression value of a particular gene.

As mentioned within the 'Introduction', we will use all of the genes present within the log2-transformed gene expression data, and not just the ones present on a single chromosome, since this notebook (among the other machine learning algorithms notebooks within this directory) represents the main experiments of the Distance Analysis part.

To retrieve the prediction accuracy of the gene expression values, the R-squared (R<sup>2</sup>) metric is computed for each of the predictions which indicates the proportion of the variance in the dependent variable that can be explained by the model [3]. Higher R<sup>2</sup> values indicate a more significant proportion of the variance in the dependent variable that can be explained by the model, with 1 being the largest possible value. This R<sup>2</sup> value is retrieved by applying 4-fold cross-validation using the training and test splits defined in the notebook 'Training and Test Set Division.ipynb' present in the 'Machine Learning Algorithms - Preprocessing' folder, which also includes the motivation behind choosing the k in k-fold cross-validation to be set equal to four, and averaging the R<sup>2</sup> value for each of the four folds. The resulting R<sup>2</sup> values for each of the genes for each of the distances listed above will be visualized within a single box plot in the notebook 'Analyzing Ridge Regression Results for Distance Analysis'.

The first thing we can do is run the 'Machine Learning Additional Functions.ipynb' notebook present in the folder 'Machine Learning Algorithms' which contains additional helper functions, such as retrieving the methylation data within a certain distance from a gene, for the machine learning algorithms. This notebook can be run by calling the command '%run' with as argument the notebook.

In [10]:
# Running the notebook 'Machine Learning Additional Functions.ipynb' by calling the command '%run'.
%run "../Machine Learning Additional Functions.ipynb"

Starting the importing of the libraries...
Finishing the installing of the libraries.
The library 'pd' (pandas) has been loaded into the notebook with its version being:
1.2.4
The library 'np' (numpy) has been loaded into the notebook with its version being:
1.20.1

The library 're' has been loaded into the notebook with its version being:
2.2.1

The library 'sys' has been loaded into the notebook with its version being:
3.8.8 (default, Apr 13 2021, 15:08:03) [MSC v.1916 64 bit (AMD64)]
The 'CpG_sites_location_data' DataFrame containing the location data of the CpG sites:
The 'genes_location_data' DataFrame containing the location data of the genes:
The 'chromosomes_length_data' DataFrame containing the lengths of the chromosomes:


As opposed to the linear regression algorithm where no additional parameters need to be considered, ridge regression has an additional parameter which represents a constant that multiplies the L2 term controlling the ridge regularization strength (the regularization parameter $\lambda$ presented in the introduction). This constant is commonly referred to as 'alpha' which is the terminology that will also be used throughout the remainder of this notebook. Since this alpha can take any value between 0 and infinity (although 0 is not recommended as this essentially represents the same OLS estimator), it is crucial that we first experiment with different values of alpha for our data to find the one that produces the highest R<sup>2</sup> scores. To achieve this, we can retrieve the R<sup>2</sup> scores by using different values of alpha where we can reduce the computational burden by only building models for the genes which are located on chromosome 1 and select the CpG sites which are present within a distance of 5,000,000 in both directions from the gene. This allows us to retrieve a single R<sup>2</sup> score for each gene for a certain alpha.

In [11]:
# Retrieving the gene expression data present in the 'gene_expression_data_log2_transformed' DataFrame of which the genes 
# are present on chromosome 1 by calling the function 'get_gene_expression_data_from_chromosome()' present within the 
# notebook 'Machine Learning Additional Functions.ipynb'.
gene_expression_data_log2_transformed_chromosome1 = get_gene_expression_data_from_chromosome(gene_expression_data_log2_transformed, 1)

# Defining the distance setting to be used for the experiment.
distance_experiment = 5000000

Before we perform the experiment, we should first define the function that will perform the ridge regression on the data for the experiment of finding the best alpha.

#### Defining the functions needed to calculate the R<sup>2</sup> scores for a given distance and alpha

Next, we can define the functions needed to calculate the R<sup>2</sup> scores for a given distance and alpha. The function 'calculate_R2_scores()' can also be used within the section 'Applying Ridge Regression' later in this notebook. We can utilize the 'joblib' library to parallelize our code as the computations for a single gene do not influence the computations of any other gene.

In [12]:
# This function calculates the R2 scores for a single 'gene' which is present in the 'gene_expression_data_log2_transformed'
# DataFrame.
def calculate_R2_scores(gene, distance, alpha): 
    
    # Retrieving the log2-transformed gene expression data of the current 'gene' and retrieving the M-transformed 
    # methylation data that is within a distance of 'distance' from this gene and for which each of the CpG sites has a 
    # correlation higher than the threshold of 0.3 with the gene (from the notebook 'Machine Learning Preliminary Analysis 
    # for Distance Analysis - Pearson's Correlation Coefficient.ipynb') by calling the function 
    # 'get_methylation_data_close_to_gene_and_with_higher_correlation_than_threshold()' which is present in the notebook 
    # 'Machine Learning Additional Functions.ipynb'.
    gene_expression_data_current_gene = gene_expression_data_log2_transformed[['Samples', gene]]
    methylation_data_close_to_current_gene = get_methylation_data_close_to_gene_and_with_higher_correlation_than_threshold(methylation_data_M_transformed, gene_expression_data_current_gene, distance=distance, threshold=0.30)

    # Defining a list where all the R2 scores for the current gene will be stored. Since one model is built per fold (split),
    # the list 'R2_scores_current_gene' will eventually contain 4 elements (4 R2 scores).
    R2_scores_current_gene = []
    
    # Looping over every column in the 'training_and_test_assignments' DataFrame such that 4-fold cross-validation is 
    # performed using the training and test sets defined within that 'training_and_test_assignments' DataFrame.
    for split in training_and_test_assignments.columns[1:]:

        # Retrieving the samples which below to the training and test set for the current split 'split'.
        selected_samples_train = training_and_test_assignments.loc[training_and_test_assignments[split] == "TRAIN", 'Samples'].tolist()
        selected_samples_test = training_and_test_assignments.loc[training_and_test_assignments[split] == "TEST", 'Samples'].tolist()

        # Retrieving the gene expression and methylation data of which the samples belong to the training set.
        gene_expression_data_current_gene_train = gene_expression_data_current_gene.loc[gene_expression_data_current_gene['Samples'].isin(selected_samples_train)].drop(columns=['Samples'])
        methylation_data_close_to_current_gene_train = methylation_data_close_to_current_gene.loc[methylation_data_close_to_current_gene['Samples'].isin(selected_samples_train)].drop(columns=['Samples'])

        # Retrieving the gene expression and methylation data of which the samples belong to the test set.
        gene_expression_data_current_gene_test = gene_expression_data_current_gene.loc[gene_expression_data_current_gene['Samples'].isin(selected_samples_test)].drop(columns=['Samples'])
        methylation_data_close_to_current_gene_test = methylation_data_close_to_current_gene.loc[methylation_data_close_to_current_gene['Samples'].isin(selected_samples_test)].drop(columns=['Samples'])
        
        # Creating a new Ridge Regression model by calling the constructor 'Ridge()' with as parameter the constant that 
        # multiplies the L2 term and calling the function 'fit()' to train the model with as X-data 
        # 'methylation_data_close_to_current_gene_train' and as Y-data 'gene_expression_data_current_gene_train'.
        model_current_gene = Ridge(alpha=alpha, max_iter=2000) 
        model_current_gene.fit(methylation_data_close_to_current_gene_train, gene_expression_data_current_gene_train)

        # Predicting the gene expression values based on the 'methylation_data_close_to_current_gene_test' by calling the 
        # function 'predict()'.
        gene_expression_data_current_gene_predict = model_current_gene.predict(methylation_data_close_to_current_gene_test)
        
        # Calculating the R2 score by calling the function 'r2_score()' with the actual values 
        # 'gene_expression_data_current_gene_test' and the predicted values 'gene_expression_data_current_gene_predict'.
        R2_score = r2_score(gene_expression_data_current_gene_test, gene_expression_data_current_gene_predict)
        
        # Adding the 'R2_score' value to the 'R2_scores_current_gene' list by calling the function 'append()'.
        R2_scores_current_gene.append(R2_score)

    return {'gene': gene, 'R2': np.mean(R2_scores_current_gene)}


# This function retrieves the R2 scores for the ridge regression models (one for each gene) fitted to predict the   
# 'gene_expression_data_log2_transformed_chromosome1' based on the 'methylation_data_M_transformed'.
def ridge_regression_with_alpha_experiment(alpha):
    
    # Retrieving the genes present in the DataFrame 'gene_expression_data_log2_transformed_chromosome1'.
    genes = gene_expression_data_log2_transformed_chromosome1.columns[1:]
    
    # Defining a list where all the R2 scores (one for each gene) will be stored such that we can later represent these
    # within a box plot to compare them with the R2 scores for the other experiments. This can be achieved by calling the
    # function 'calculate_R2_scores()' for each of the genes. Since the computations for a single gene do not influence the 
    # computations of any other gene, we can parallelize the execution of this function by calling the function 'Parallel()' 
    # from the 'joblib' library.
    R2_scores = Parallel(n_jobs=512)(delayed(calculate_R2_scores)(gene, distance_experiment, alpha) for gene in genes)
    
    # Combining all the key-value pairs into a single dictionary.
    R2_scores_dictionary = {R2_score['gene']: R2_score['R2'] for R2_score in R2_scores}
    
    return R2_scores_dictionary

#### Running the experiment to find the best alpha

Next, we can define a list of alphas for which each the function 'ridge_regression_with_alpha_experiment()' is called. 

In [13]:
# The list of alphas for which each the function 'ridge_regression_with_alpha_experiment()' is called. Since alpha can go 
# till infinity, we can first look at values of 0.01 through 10. 
alphas = [0.10, 0.15, 0.20, 0.25, 0.30, 0.40, 0.50, 0.75, 1.00, 1.50, 2.00, 3.00, 4.00, 5.00, 6.00, 7.00, 8.00, 9.00, 10.00]

# Creating a dictionary which will later contain the lists of R2 scores for the alpha values defined in the list above.
alphas_0_to_10 = ['0.10', '0.15', '0.20', '0.25', '0.30', '0.40', '0.50', '0.75', '1.00', '1.50', '2.00', '3.00', '4.00', '5.00', '6.00', '7.00', '8.00', '9.00', '10.00']
R2_0_to_10 = {alpha: [] for alpha in alphas_0_to_10}

# Creating the DataFrame by calling the constructor 'DataFrame()' which takes as input the dictionary to be converted into a
# DataFrame.
R2_0_to_10_df = pd.DataFrame(R2_0_to_10)

# Adding the names of the all the genes for which the R2 score is computed to the DataFrame storing all of the R2 scores.
R2_0_to_10_df.insert(0, 'Gene', gene_expression_data_log2_transformed_chromosome1.columns[1:])

print("The empty 'R2_0_to_10_df' DataFrame:")
R2_0_to_10_df

The empty 'R2_0_to_10_df' DataFrame:


Unnamed: 0,Gene,0.05,0.10,0.15,0.20,0.25,0.30,0.40,0.50,0.75,...,1.50,2.00,3.00,4.00,5.00,6.00,7.00,8.00,9.00,10.00
0,ENSG00000004487,,,,,,,,,,...,,,,,,,,,,
1,ENSG00000007341,,,,,,,,,,...,,,,,,,,,,
2,ENSG00000007923,,,,,,,,,,...,,,,,,,,,,
3,ENSG00000007933,,,,,,,,,,...,,,,,,,,,,
4,ENSG00000010803,,,,,,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
374,ENSG00000285867,,,,,,,,,,...,,,,,,,,,,
375,ENSG00000286383,,,,,,,,,,...,,,,,,,,,,
376,ENSG00000286619,,,,,,,,,,...,,,,,,,,,,
377,ENSG00000287064,,,,,,,,,,...,,,,,,,,,,


Now, we can call the function 'ridge_regression_with_alpha_experiment()' for each of the alphas.

In [None]:
# Looping over every 'alpha' within the 'alphas' list and applying ridge regression to the M-transformed methylation data 
# and log2-transformed gene expression data for the 'alpha' and adding their resulting lists filled with R2 scores to the 
# 'R2_0_to_10_df' DataFrame.
for index, alpha in enumerate(alphas):
    
    start = time.time()
    
    # Retrieving the R2 scores for the ridge regression models (one for each gene) fitted to predict the gene expression 
    # values based on the methylation data by calling the function 'ridge_regression_with_alpha_experiment()' with as 
    # argument the 'alpha'.
    R2_scores_current_alpha = ridge_regression_with_alpha_experiment(alpha)
    
    # Adding the 'R2_scores_current_alpha' to the corresponding column of the general DataFrame.
    R2_0_to_10_df[alphas_0_to_10[index]] = R2_scores_current_alpha.values()
    
    end = time.time()
    print(f"{end-start} seconds")
    
print("The 'R2_0_to_10_df' DataFrame:")
R2_0_to_10_df

In [None]:
# Defining where to save the resulting file and its name.
file_to_save = data_directory_results_distance + "/R2_alpha_0_to_10_df.csv"

# Saving the file.
if not os.path.exists(file_to_save):
    R2_0_to_10_df.to_csv(file_to_save, index=False)
    print("The file with the path " + file_to_save + " has been created.")
else:
    print("The file with the path " + file_to_save + " already exists.")

Using the 'R2_0_to_10_df' DataFrame to which all R<sup>2</sup> scores have been added for each of the alphas, we can now create the box plots (one for each alpha) by calling the function 'boxplot()' from the 'Seaborn' library. We can also save this plot to the directory 'data_directory_results_distance' by calling the function 'savefig()'.

In [None]:
plt.figure(figsize=(20, 12))

# Creating a boxplot for every column (sample) in the 'R2_0_to_10_df' DataFrame, plotting them on the same axis, without 
# showing the outliers.
ax = sns.boxplot(data=R2_0_to_10_df, showfliers=False)
ax.set_xticklabels(ax.get_xticklabels(), rotation=45, ha='right', rotation_mode='anchor', fontsize=24)
ax.tick_params(axis='y', labelsize=24)

# Adding the legend, a title and the labels to the plot.
ax.set_title('The Distributions of the R2 scores of the Alphas', pad=20, fontsize=30)
ax.set_xlabel('Alpha', labelpad=20, fontsize=30)
ax.set_ylabel('R2 score', labelpad=20, fontsize=30)

# Saving the plot by calling the function 'savefig()'.
file_to_save = data_directory_results_distance + f"/R2_scores_alphas_0_to_10.png"
plt.savefig(file_to_save, bbox_inches='tight')

# Show the plot
plt.show()

As we can see from the box plots above, it seems that the median R<sup>2</sup> score slighlty increases each time alpha increases. In addition, we can also calculate the distributions for alphas ranging from 10 to 10,000 by for each alpha calling the function 'ridge_regression_with_alpha_experiment()' to see whether this continues to do so.

In [3]:
# The list of alphas ranging from 10 through 10,000 for which each the function 'lasso_regression_with_alpha_experiment()' 
# is called.
alphas = [10, 25, 50, 100, 150, 250, 375, 500, 650, 800, 1000, 1250, 1500, 2000, 3000, 5000, 7500, 10000]

# Creating a dictionary which will later contain the lists of R2 scores for the alpha values defined in the list above.
alphas_10_to_10000 = ['10', '25', '50', '100', '150', '250', '375', '500', '650', '800', '1000', '1250', '1500', '2000', '3000', '5000', '7500', '10000']
R2_10_to_10000 = {alpha: [] for alpha in alphas_10_to_10000}

# Creating the DataFrame by calling the constructor 'DataFrame()' which takes as input the dictionary to be converted into a
# DataFrame.
R2_10_to_10000_df = pd.DataFrame(R2_10_to_10000)

# Adding the names of the all the genes for which the R2 score is computed to the DataFrame storing all of the R2 scores.
R2_10_to_10000_df.insert(0, 'Gene', gene_expression_data_log2_transformed_chromosome1.columns[1:])

print("The empty 'R2_10_to_10000_df' DataFrame:")
R2_10_to_10000_df

NameError: name 'gene_expression_data_log2_transformed_chromosome1' is not defined

Now, we can call the function 'ridge_regression_with_alpha_experiment()' for each of the alphas.

In [None]:
# Looping over every 'alpha' within the 'alphas' list and applying ridge regression to the M-transformed methylation data 
# and log2-transformed gene expression data for the 'alpha' and adding their resulting lists filled with R2 scores to the 
# 'R2_10_to_10000_df' DataFrame.
for index, alpha in enumerate(alphas):
    
    start = time.time()
    
    # Retrieving the R2 scores for the ridge regression models (one for each gene) fitted to predict the gene expression 
    # values based on the methylation data by calling the function 'ridge_regression_with_alpha_experiment()' with as 
    # argument the 'alpha'.
    R2_scores_current_alpha = ridge_regression_with_alpha_experiment(alpha)
    
    # Adding the 'R2_scores_current_alpha' to the corresponding column of the general DataFrame.
    R2_10_to_10000_df[alphas_10_to_10000[index]] = R2_scores_current_alpha.values()
    
    end = time.time()
    print(f"{end-start} seconds")
    
print("The 'R2_10_to_10000_df' DataFrame:")
R2_10_to_10000_df

In [None]:
# Defining where to save the resulting file and its name.
file_to_save = data_directory_results_distance + "/R2_alpha_10_to_10000_df.csv"

# Saving the file.
if not os.path.exists(file_to_save):
    R2_10_to_10000_df.to_csv(file_to_save, index=False)
    print("The file with the path " + file_to_save + " has been created.")
else:
    print("The file with the path " + file_to_save + " already exists.")

Using the 'R2_10_to_10000_df' DataFrame to which all R<sup>2</sup> scores have been added for each of the alphas, we can now create the box plots (one for each alpha) by calling the function 'boxplot()' from the 'Seaborn' library. We can also save this plot to the directory 'data_directory_results_distance' by calling the function 'savefig()'.

In [None]:
plt.figure(figsize=(20, 12))

# Creating a boxplot for every column (sample) in the 'R2_10_to_10000_df' DataFrame, plotting them on the same axis, without 
# showing the outliers.
ax = sns.boxplot(data=R2_10_to_10000_df, showfliers=False)
ax.set_xticklabels(ax.get_xticklabels(), rotation=45, ha='right', rotation_mode='anchor', fontsize=24)
ax.tick_params(axis='y', labelsize=24)

# Adding the legend, a title and the labels to the plot.
ax.set_title('The Distributions of the R2 scores of the Alphas', pad=20, fontsize=30)
ax.set_xlabel('Alpha', labelpad=20, fontsize=30)
ax.set_ylabel('R2 score', labelpad=20, fontsize=30)

# Saving the plot by calling the function 'savefig()'.
file_to_save = data_directory_results_distance + f"/R2_scores_alphas_10_to_10000.png"
plt.savefig(file_to_save, bbox_inches='tight')

# Show the plot
plt.show()

As we can see from the box plots above, there seems to be an optimal value of alpha as an alpha of around 375 has the highest R<sup>2</sup> score median. A problem with choosing such a high value for alpha is that the model may be too simplistic to capture the complexity of the underlying data (underfitting) resulting in high bias. This could happen as a large regularization parameter alpha causes an increase in the penalty applied to the magnitude of the coefficients. This in turn leads to a significant reduction in the estimated coefficient values. In addition, when using a large regularization parameter alpha the ridge regression algorithm becomes more focused on reducing the magnitude of the coefficients rather than actually fitting the data making the model less sensitive to the patterns and relationships that occur in the data.

Therefore, it is best to choose a smaller alpha where we can choose an alpha of 0.90 which will help us to avoid the problems mentioned above. This means that this will be the alpha used throughout the remainder of this notebook for experimenting with different distances in the section 'Applying Ridge Regression'.

### Applying Ridge Regression

Now we can apply ridge regression with an alpha of 0.90 to the M-transformed methylation data and log2-transformed gene expression data. The distances will be divided into 8 blocks of 5 distances because when they all appear within the same code block, it is quite a memory expensive computation (and we do not want to lose any progress). Separating these distances into multiple blocks which allows us to save a different DataFrame featuring the R<sup>2</sup> scores for each of the 8 blocks which can later be easily recombined into a single DataFrame for displaying purposes. This can be achieved by calling the function 'to.csv()' for each DataFrame.

Before we do this we should first define the function 'ridge_regression_with_distance()' below.

In [None]:
# This function retrieves the R2 scores for the ridge regression models (one for each gene) fitted to predict the   
# 'gene_expression_data_log2_transformed' based on the 'methylation_data_M_transformed'.
def ridge_regression_with_distance(distance):
    
    # Retrieving the genes present in the DataFrame 'gene_expression_data_log2_transformed'.
    genes = gene_expression_data_log2_transformed.columns[1:]
    
    # Defining a list where all the R2 scores (one for each gene) will be stored such that we can later represent these
    # within a box plot to compare them with the R2 scores for the other experiments. This can be achieved by calling the
    # function 'calculate_R2_scores()' for each of the genes. Since the computations for a single gene do not influence the 
    # computations of any other gene, we can parallelize the execution of this function by calling the function 'Parallel()' 
    # from the 'joblib' library.
    R2_scores = Parallel(n_jobs=512)(delayed(calculate_R2_scores)(gene, distance, 0.90) for gene in genes)
    
    # Combining all the key-value pairs into a single dictionary.
    R2_scores_dictionary = {R2_score['gene']: R2_score['R2'] for R2_score in R2_scores}
    
    return R2_scores_dictionary

In addition, we should also define the list featuring all the distances that will be experimented with.

In [None]:
# Defining the list featuring all the distances that will be experimented with.
distances = [5000, 10000, 15000, 25000, 50000, 75000, 100000, 150000, 250000, 350000, 500000, 750000, 
             1000000, 1500000, 2000000, 2500000, 4000000, 5000000, 6000000, 7500000, 10000000, 12500000, 
             15000000, 17500000, 20000000, 25000000, 30000000, 40000000, 50000000, 65000000, 80000000, 
             100000000, 120000000, 150000000, 200000000, 250000000, 350000000, 500000000, 750000000, 1000000000]

#### Ridge Regression for Distances 5,000 through 50,000

In [23]:
# Creating a dictionary which will later contain the lists of R2 scores for distance 5,000 through 50,000.
numbers_5000_to_50000 = ['5,000', '10,000', '15,000', '25,000', '50,000']
R2_5000_to_50000 = {number: [] for number in numbers_5000_to_50000}

# Creating the DataFrame by calling the constructor 'DataFrame()' which takes as input the dictionary to be converted into a
# DataFrame.
R2_5000_to_50000_df = pd.DataFrame(R2_5000_to_50000)

# Adding the names of the all the genes for which the R2 score is computed to the DataFrame storing all of the R2 scores.
R2_5000_to_50000_df.insert(0, 'Gene', gene_expression_data_log2_transformed.columns[1:])

print("The empty 'R2_5000_to_50000_df' DataFrame:")
R2_5000_to_50000_df

The empty 'R2_5000_to_50000_df' DataFrame:


Unnamed: 0,Gene,"5,000","10,000","15,000","25,000","50,000"
0,ENSG00000001561,,,,,
1,ENSG00000001629,,,,,
2,ENSG00000001631,,,,,
3,ENSG00000002587,,,,,
4,ENSG00000002746,,,,,
...,...,...,...,...,...,...
3438,ENSG00000287893,,,,,
3439,ENSG00000288156,,,,,
3440,ENSG00000288586,,,,,
3441,ENSG00000288612,,,,,


In [None]:
# Looping over every 'distance' within the 'distances' list that is present in the current block and applying ridge 
# regression to the M-transformed methylation data and log2-transformed gene expression data for the 'distance' and adding 
# their resulting lists filled with R2 scores to the 'R2_5000_to_50000_df' DataFrame.
for index, distance in enumerate(distances[0:5]):
    
    start = time.time()
    
    # Retrieving the R2 scores for the ridge regression models (one for each gene) fitted to predict the gene expression 
    # values based on the methylation data by calling the function 'ridge_regression_with_distance()' with as argument the 
    # 'distance' which will be used to filter the CpG sites used to predict the expression value for a gene.
    R2_scores_current_distance = ridge_regression_with_distance(distance)
    
    # Adding the 'R2_scores_current_distance' to the corresponding column of the general DataFrame.
    R2_5000_to_50000_df[numbers_5000_to_50000[index]] = R2_scores_current_distance.values()
    
    end = time.time()
    print(f"{end-start} seconds")
    
print("The 'R2_5000_to_50000_df' DataFrame:")
R2_5000_to_50000_df

In [None]:
# Defining where to save the resulting file and its name.
file_to_save = data_directory_results_distance + "/R2_5000_to_50000_df.csv"

# Saving the file.
if not os.path.exists(file_to_save):
    R2_5000_to_50000_df.to_csv(file_to_save, index=False)
    print("The file with the path " + file_to_save + " has been created.")
else:
    print("The file with the path " + file_to_save + " already exists.")

#### Ridge Regression for Distances 75,000 through 350,000

In [None]:
# Creating a dictionary which will later contain the lists of R2 scores for distance 75,000 through 350,000.
numbers_75000_to_350000 = ['75,000', '100,000', '150,000', '250,000', '350,000']
R2_75000_to_350000 = {number: [] for number in numbers_75000_to_350000}

# Creating the DataFrame by calling the constructor 'DataFrame()' which takes as input the dictionary to be converted into a
# DataFrame.
R2_75000_to_350000_df = pd.DataFrame(R2_75000_to_350000)

# Adding the names of the all the genes for which the R2 score is computed to the DataFrame storing all of the R2 scores.
R2_75000_to_350000_df.insert(0, 'Gene', gene_expression_data_log2_transformed.columns[1:])

print("The empty 'R2_75000_to_350000_df' DataFrame:")
R2_75000_to_350000_df

In [None]:
# Looping over every 'distance' within the 'distances' list that is present in the current block and applying ridge 
# regression to the M-transformed methylation data and log2-transformed gene expression data for the 'distance' and adding 
# their resulting lists filled with R2 scores to the 'R2_75000_to_350000_df' DataFrame.
for index, distance in enumerate(distances[5:10]):
    
    start = time.time()
    
    # Retrieving the R2 scores for the ridge regression models (one for each gene) fitted to predict the gene expression 
    # values based on the methylation data by calling the function 'ridge_regression_with_distance()' with as argument the 
    # 'distance' which will be used to filter the CpG sites used to predict the expression value for a gene.
    R2_scores_current_distance = ridge_regression_with_distance(distance)
    
    # Adding the 'R2_scores_current_distance' to the corresponding column of the general DataFrame.
    R2_75000_to_350000_df[numbers_75000_to_350000[index]] = R2_scores_current_distance.values()
    
    end = time.time()
    print(f"{end-start} seconds")
    
print("The 'R2_75000_to_350000_df' DataFrame:")
R2_75000_to_350000_df

In [None]:
# Defining where to save the resulting file and its name.
file_to_save = data_directory_results_distance + "/R2_75000_to_350000_df.csv"

# Saving the file.
if not os.path.exists(file_to_save):
    R2_75000_to_350000_df.to_csv(file_to_save, index=False)
    print("The file with the path " + file_to_save + " has been created.")
else:
    print("The file with the path " + file_to_save + " already exists.")

#### Ridge Regression for Distances 500,000 through 2,000,000

In [None]:
# Creating a dictionary which will later contain the lists of R2 scores for distance 500,000 through 2,000,000.
numbers_500000_to_2000000 = ['500,000', '750,000', '1,000,000', '1,500,000', '2,000,000']
R2_500000_to_2000000 = {number: [] for number in numbers_500000_to_2000000}

# Creating the DataFrame by calling the constructor 'DataFrame()' which takes as input the dictionary to be converted into a
# DataFrame.
R2_500000_to_2000000_df = pd.DataFrame(R2_500000_to_2000000)

# Adding the names of the all the genes for which the R2 score is computed to the DataFrame storing all of the R2 scores.
R2_500000_to_2000000_df.insert(0, 'Gene', gene_expression_data_log2_transformed.columns[1:])

print("The empty 'R2_500000_to_2000000_df' DataFrame:")
R2_500000_to_2000000_df

In [None]:
# Looping over every 'distance' within the 'distances' list that is present in the current block and applying ridge 
# regression to the M-transformed methylation data and log2-transformed gene expression data for the 'distance' and adding 
# their resulting lists filled with R2 scores to the 'R2_500000_to_2000000_df' DataFrame.
for index, distance in enumerate(distances[10:15]):
    
    start = time.time()
    
    # Retrieving the R2 scores for the ridge regression models (one for each gene) fitted to predict the gene expression 
    # values based on the methylation data by calling the function 'ridge_regression_with_distance()' with as argument the 
    # 'distance' which will be used to filter the CpG sites used to predict the expression value for a gene.
    R2_scores_current_distance = ridge_regression_with_distance(distance)
    
    # Adding the 'R2_scores_current_distance' to the corresponding column of the general DataFrame.
    R2_500000_to_2000000_df[numbers_500000_to_2000000[index]] = R2_scores_current_distance.values()
    
    end = time.time()
    print(f"{end-start} seconds")
    
print("The '500000_to_2000000' DataFrame:")
R2_500000_to_2000000_df

In [None]:
# Defining where to save the resulting file and its name.
file_to_save = data_directory_results_distance + "/R2_500000_to_2000000_df.csv"

# Saving the file.
if not os.path.exists(file_to_save):
    R2_500000_to_2000000_df.to_csv(file_to_save, index=False)
    print("The file with the path " + file_to_save + " has been created.")
else:
    print("The file with the path " + file_to_save + " already exists.")

#### Ridge Regression for Distances 2,500,000 through 7,500,000

In [None]:
# Creating a dictionary which will later contain the lists of R2 scores for distance 2,500,000 through 7,500,000.
numbers_2500000_to_7500000 = ['2,500,000', '4,000,000', '5,000,000', '6,000,000', '7,500,000']
R2_2500000_to_7500000 = {number: [] for number in numbers_2500000_to_7500000}

# Creating the DataFrame by calling the constructor 'DataFrame()' which takes as input the dictionary to be converted into a
# DataFrame.
R2_2500000_to_7500000_df = pd.DataFrame(R2_2500000_to_7500000)

# Adding the names of the all the genes for which the R2 score is computed to the DataFrame storing all of the R2 scores.
R2_2500000_to_7500000_df.insert(0, 'Gene', gene_expression_data_log2_transformed.columns[1:])

print("The empty 'R2_2500000_to_7500000_df' DataFrame:")
R2_2500000_to_7500000_df

In [None]:
# Looping over every 'distance' within the 'distances' list that is present in the current block and applying ridge 
# regression to the M-transformed methylation data and log2-transformed gene expression data for the 'distance' and adding 
# their resulting lists filled with R2 scores to the 'R2_2500000_to_7500000_df' DataFrame.
for index, distance in enumerate(distances[15:20]):
    
    start = time.time()
    
    # Retrieving the R2 scores for the ridge regression models (one for each gene) fitted to predict the gene expression 
    # values based on the methylation data by calling the function 'ridge_regression_with_distance()' with as argument the 
    # 'distance' which will be used to filter the CpG sites used to predict the expression value for a gene.
    R2_scores_current_distance = ridge_regression_with_distance(distance)
    
    # Adding the 'R2_scores_current_distance' to the corresponding column of the general DataFrame.
    R2_2500000_to_7500000_df[numbers_2500000_to_7500000[index]] = R2_scores_current_distance.values()
    
    end = time.time()
    print(f"{end-start} seconds")
    
print("The 'R2_2500000_to_7500000_df' DataFrame:")
R2_2500000_to_7500000_df

In [None]:
# Defining where to save the resulting file and its name.
file_to_save = data_directory_results_distance + "/R2_2500000_to_7500000_df.csv"

# Saving the file.
if not os.path.exists(file_to_save):
    R2_2500000_to_7500000_df.to_csv(file_to_save, index=False)
    print("The file with the path " + file_to_save + " has been created.")
else:
    print("The file with the path " + file_to_save + " already exists.")

#### Ridge Regression for Distances 10,000,000 through 20,000,000

In [None]:
# Creating a dictionary which will later contain the lists of R2 scores for distance 10,000,000 through 20,000,000.
numbers_10000000_to_20000000 = ['10,000,000', '12,500,000', '15,000,000', '17,500,000', '20,000,000']
R2_10000000_to_20000000 = {number: [] for number in numbers_10000000_to_20000000}

# Creating the DataFrame by calling the constructor 'DataFrame()' which takes as input the dictionary to be converted into a
# DataFrame.
R2_10000000_to_20000000_df = pd.DataFrame(R2_10000000_to_20000000)

# Adding the names of the all the genes for which the R2 score is computed to the DataFrame storing all of the R2 scores.
R2_10000000_to_20000000_df.insert(0, 'Gene', gene_expression_data_log2_transformed.columns[1:])

print("The empty 'R2_10000000_to_20000000_df' DataFrame:")
R2_10000000_to_20000000_df

In [None]:
# Looping over every 'distance' within the 'distances' list that is present in the current block and applying ridge 
# regression to the M-transformed methylation data and log2-transformed gene expression data for the 'distance' and adding 
# their resulting lists filled with R2 scores to the 'R2_10000000_to_20000000_df' DataFrame.
for index, distance in enumerate(distances[20:25]):
    
    start = time.time()
    
    # Retrieving the R2 scores for the ridge regression models (one for each gene) fitted to predict the gene expression 
    # values based on the methylation data by calling the function 'ridge_regression_with_distance()' with as argument the 
    # 'distance' which will be used to filter the CpG sites used to predict the expression value for a gene.
    R2_scores_current_distance = ridge_regression_with_distance(distance)
    
    # Adding the 'R2_scores_current_distance' to the corresponding column of the general DataFrame.
    R2_10000000_to_20000000_df[numbers_10000000_to_20000000[index]] = R2_scores_current_distance.values()
    
    end = time.time()
    print(f"{end-start} seconds")
    
print("The 'R2_10000000_to_20000000_df' DataFrame:")
R2_10000000_to_20000000_df

In [None]:
# Defining where to save the resulting file and its name.
file_to_save = data_directory_results_distance + "/R2_10000000_to_20000000_df.csv"

# Saving the file.
if not os.path.exists(file_to_save):
    R2_10000000_to_20000000_df.to_csv(file_to_save, index=False)
    print("The file with the path " + file_to_save + " has been created.")
else:
    print("The file with the path " + file_to_save + " already exists.")

#### Ridge Regression for Distances 25,000,000 through 65,000,000

In [None]:
# Creating a dictionary which will later contain the lists of R2 scores for distance 25,000,000 through 65,000,000.
numbers_25000000_to_65000000 = ['25,000,000', '30,000,000', '40,000,000', '50,000,000', '65,000,000']
R2_25000000_to_65000000 = {number: [] for number in numbers_25000000_to_65000000}

# Creating the DataFrame by calling the constructor 'DataFrame()' which takes as input the dictionary to be converted into a
# DataFrame.
R2_25000000_to_65000000_df = pd.DataFrame(R2_25000000_to_65000000)

# Adding the names of the all the genes for which the R2 score is computed to the DataFrame storing all of the R2 scores.
R2_25000000_to_65000000_df.insert(0, 'Gene', gene_expression_data_log2_transformed.columns[1:])

print("The empty 'R2_25000000_to_65000000_df' DataFrame:")
R2_25000000_to_65000000_df

In [None]:
# Looping over every 'distance' within the 'distances' list that is present in the current block and applying ridge 
# regression to the M-transformed methylation data and log2-transformed gene expression data for the 'distance' and adding 
# their resulting lists filled with R2 scores to the 'R2_25000000_to_65000000_df' DataFrame.
for index, distance in enumerate(distances[25:30]):
    
    start = time.time()
    
    # Retrieving the R2 scores for the ridge regression models (one for each gene) fitted to predict the gene expression 
    # values based on the methylation data by calling the function 'ridge_regression_with_distance()' with as argument the 
    # 'distance' which will be used to filter the CpG sites used to predict the expression value for a gene.
    R2_scores_current_distance = ridge_regression_with_distance(distance)
    
    # Adding the 'R2_scores_current_distance' to the corresponding column of the general DataFrame.
    R2_25000000_to_65000000_df[numbers_25000000_to_65000000[index]] = R2_scores_current_distance.values()
    
    end = time.time()
    print(f"{end-start} seconds")
    
print("The 'R2_25000000_to_65000000_df' DataFrame:")
R2_25000000_to_65000000_df

In [None]:
# Defining where to save the resulting file and its name.
file_to_save = data_directory_results_distance + "/R2_25000000_to_65000000_df.csv"

# Saving the file.
if not os.path.exists(file_to_save):
    R2_25000000_to_65000000_df.to_csv(file_to_save, index=False)
    print("The file with the path " + file_to_save + " has been created.")
else:
    print("The file with the path " + file_to_save + " already exists.")

#### Ridge Regression for Distances 80,000,000 through 200,000,000

In [None]:
# Creating a dictionary which will later contain the lists of R2 scores for distance 80,000,000 through 200,000,000.
numbers_80000000_to_200000000 = ['80,000,000', '100,000,000', '120,000,000', '150,000,000', '200,000,000']
R2_80000000_to_200000000 = {number: [] for number in numbers_80000000_to_200000000}

# Creating the DataFrame by calling the constructor 'DataFrame()' which takes as input the dictionary to be converted into a
# DataFrame.
R2_80000000_to_200000000_df = pd.DataFrame(R2_80000000_to_200000000)

# Adding the names of the all the genes for which the R2 score is computed to the DataFrame storing all of the R2 scores.
R2_80000000_to_200000000_df.insert(0, 'Gene', gene_expression_data_log2_transformed.columns[1:])

print("The empty 'R2_80000000_to_200000000_df' DataFrame:")
R2_80000000_to_200000000_df

In [1]:
# Looping over every 'distance' within the 'distances' list that is present in the current block and applying ridge 
# regression to the M-transformed methylation data and log2-transformed gene expression data for the 'distance' and adding 
# their resulting lists filled with R2 scores to the 'R2_80000000_to_200000000_df' DataFrame.
for index, distance in enumerate(distances[30:35]):
    
    start = time.time()
    
    # Retrieving the R2 scores for the ridge regression models (one for each gene) fitted to predict the gene expression 
    # values based on the methylation data by calling the function 'ridge_regression_with_distance()' with as argument the 
    # 'distance' which will be used to filter the CpG sites used to predict the expression value for a gene.
    R2_scores_current_distance = ridge_regression_with_distance(distance)
    
    # Adding the 'R2_scores_current_distance' to the corresponding column of the general DataFrame.
    R2_80000000_to_200000000_df[numbers_80000000_to_200000000[index]] = R2_scores_current_distance.values()
    
    end = time.time()
    print(f"{end-start} seconds")
    
print("The 'R2_80000000_to_200000000_df' DataFrame:")
R2_80000000_to_200000000_df

NameError: name 'distances' is not defined

In [None]:
# Defining where to save the resulting file and its name.
file_to_save = data_directory_results_distance + "/R2_80000000_to_200000000_df.csv"

# Saving the file.
if not os.path.exists(file_to_save):
    R2_80000000_to_200000000_df.to_csv(file_to_save, index=False)
    print("The file with the path " + file_to_save + " has been created.")
else:
    print("The file with the path " + file_to_save + " already exists.")

#### Ridge Regression for Distances 250,000,000 through 1,000,000,000

In [None]:
# Creating a dictionary which will later contain the lists of R2 scores for distance 250,000,000 through 1,000,000,000.
numbers_250000000_to_1000000000 = ['250,000,000', '350,000,000', '500,000,000', '750,000,000', '1,000,000,000']
R2_250000000_to_1000000000 = {number: [] for number in numbers_250000000_to_1000000000}

# Creating the DataFrame by calling the constructor 'DataFrame()' which takes as input the dictionary to be converted into a
# DataFrame.
R2_250000000_to_1000000000_df = pd.DataFrame(R2_250000000_to_1000000000)

# Adding the names of the all the genes for which the R2 score is computed to the DataFrame storing all of the R2 scores.
R2_250000000_to_1000000000_df.insert(0, 'Gene', gene_expression_data_log2_transformed.columns[1:])

print("The empty 'R2_250000000_to_1000000000_df' DataFrame:")
R2_250000000_to_1000000000_df

In [None]:
# Looping over every 'distance' within the 'distances' list that is present in the current block and applying ridge 
# regression to the M-transformed methylation data and log2-transformed gene expression data for the 'distance' and adding 
# their resulting lists filled with R2 scores to the 'R2_250000000_to_1000000000_df' DataFrame.
for index, distance in enumerate(distances[35:]):
    
    start = time.time()
    
    # Retrieving the R2 scores for the ridge regression models (one for each gene) fitted to predict the gene expression 
    # values based on the methylation data by calling the function 'ridge_regression_with_distance()' with as argument the 
    # 'distance' which will be used to filter the CpG sites used to predict the expression value for a gene.
    R2_scores_current_distance = ridge_regression_with_distance(distance)
    
    # Adding the 'R2_scores_current_distance' to the corresponding column of the general DataFrame.
    R2_250000000_to_1000000000_df[numbers_250000000_to_1000000000[index]] = R2_scores_current_distance.values()
    
    end = time.time()
    print(f"{end-start} seconds")
    
print("The 'R2_250000000_to_1000000000_df' DataFrame:")
R2_250000000_to_1000000000_df

In [None]:
# Defining where to save the resulting file and its name.
file_to_save = data_directory_results_distance + "/R2_250000000_to_1000000000_df.csv"

# Saving the file.
if not os.path.exists(file_to_save):
    R2_250000000_to_1000000000_df.to_csv(file_to_save, index=False)
    print("The file with the path " + file_to_save + " has been created.")
else:
    print("The file with the path " + file_to_save + " already exists.")