# Machine Learning Preliminary Analysis for CpG Analysis - Pearson's Correlation Coefficient
### Laurence Nickel (i6257119)

Libraries used: 
* pandas (version: '1.2.4')
* numpy (version: '1.20.1')
* re (version: '2.2.1')
* sys (version: '3.8.8')
* os (version: '3.8.8')
* plotly.express (version: '5.13.1')
* plotly.graph_objects (version: '5.13.1')
* plotly.subplots (version: '5.13.1')
* joblib (version: '1.0.1')

References:
* [1] Benesty, J., Chen, J., Huang, Y., & Cohen, I. (2009). "Pearson Correlation Coefficient," in *Noise Reduction in Speech Processing* (Berlin/Heidelberg, Germany: Springer), 37-40.
* [2] Spainhour, J. C. G., Lim, H. E., Yi, S. V., & Qiu, P. (2019). Correlation Patterns Between DNA Methylation and Gene Expression in The Cancer Genome Atlas. *Cancer Informatics, 18*: 117693511982877. doi: https://doi.org/10.1177/1176935119828776.
* [3] Siegfried, Z., & Simon, I. (2010). DNA Methylation and Gene Expression. *WIREs Mechanisms of Disease 2*(3), 362-371. doi: https://doi.org/10.1002/wsbm.64.
* [4] Mukaka, M. (2012). A Guide to Appropriate Use of Correlation Coefficient in Medical Research. *Malawi Medical Journal 24*(3), 69-71. doi: https://pubmed.ncbi.nlm.nih.gov/23638278.

## Introduction

Within this notebook, some preliminary analysis for the machine learning algorithms is performed. This analysis involves calculating the Pearson's correlation coefficient between every CpG site and every gene and determining the optimal threshold for the correlation coefficient where CpG sites that have a lower correlation coefficient with the current gene than this threshold are not considered for the prediction of this gene. This helps to identify the relevant CpG sites for predicting the value of a gene but does not get rid of too many CpG sites for a single gene to the extent that there are no more CpG sites left for a gene to base its predicton on. Pearson's correlation coefficient is a statistical measure that quantifies the strength and direction of the linear relationship between two variables with a value between -1 and 1 [1]. Here, a correlation coefficient of 1 indicates that there exists a perfect positive linear relationship between two variables (where if one increases, the other one also increases) and a correlation coefficient of -1 indicates that there exists a perfect negative linear relationship between two variables (where if one increases, the other one decreases). A correlation coefficient of 0 suggests that there is no linear relationship between the two variables meaning that a change in one variables (either increase or decrease) has no predictable effect of whether the other variable will increase or decrease. Please mind that since correlation coefficient falls between -1 and 1, we would consider the absolute value of the threshold as a correlation coefficient of -0.9, for example, still means the CpG site and gene are strongly correlated (but just negatively). 

A problem with uncorrelated CpG sites and genes is that basing the expression value of a gene on CpG sites that are not correlated to the gene could result in a worse performance as opposed to leaving those CpG sites out or even just predicting a random value or the mean expression value. Therefore, we need to find what is the correlation coefficient threshold that we are certain that the CpG sites used to predict the expression value of a gene are actually related to a gene. We can apply Pearson's correlation coefficient to find the correlation coefficient between gene expression data and methylation data as it has already been successfully performed before showing that Pearson's correlation coefficient is suitable for working with gene expression and methylation data [2].

What we will do in the end, which is present in the section 'Selecting Set of Predictable Genes', is that we will select a set of genes for which all the genes within that set have at least a few (2) CpG sites correlated to them. Since for the CpG Site Analysis part only a single distance is used, we should here choose the distance that will also be used for the remainder of the CpG Site Analysis part to ensure that the property of every gene having at least 2 CpG sites correlated to them within the distance holds after having performed the filtering in this notebook. From the notebooks 'Analyzing Linear Regression Results for Distance Analysis.ipynb', 'Analyzing Lasso Regression Results for Distance Analysis.ipynb', 'Analyzing Ridge Regression Results for Distance Analysis.ipynb', and 'Analyzing Elastic Net Regression Results for Distance Analysis.ipynb' we can conclude that a distance of 250,000,000 in general performs the best (of course also realizing that there may be a separate best performing distance for each of the four machine learning algorithms applied for the Distance Analysis part. This distance will be used for the remainder of the CpG Analysis part and thus also for the experiments performed within this notebook. The selection procedure is based on the gene expression data file called 'gene_expression_data_log2_transformed_final.csv' and the methylation data file called 'methylation_data_M_transformed_final.csv' as these together were determined in the notebook 'Linear Regression for Testing the Datasets.ipynb' to be the best performing dataset combination and these will thus be used for the remainder of the CpG Site Analysis part including this preliminary analysis notebook.  

### Importing libraries

Before we can start to define all the functions, we should first import some libraries that will be used throughout this notebook.

In [28]:
print("Starting the importing of the libraries...")


import pandas as pd
import numpy as np

# Here we first need to install the plotly library.
!pip install plotly
import plotly
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

import sys
import os
import re

!pip install joblib
import joblib
from joblib import Parallel, delayed


print("Finishing the installing of the libraries.")

Starting the importing of the libraries...
Finishing the installing of the libraries.


Now that all the libraries have been imported, we can verify that these libraries have been loaded into this notebook by calling the version property of the library.

In [29]:
# Retrieving the version of the libraries to verify they have been correctly loaded into this notebook.
print("The library 'pd' (pandas) has been loaded into the notebook with its version being:")
print(pd.__version__)

print("The library 'np' (numpy) has been loaded into the notebook with its version being:")
print(np.__version__)

print("\nThe library 're' has been loaded into the notebook with its version being:")
print(re.__version__)

print("\nThe library 'sys' has been loaded into the notebook with its version being:")
print(sys.version)

print("\nThe library 'plotly' has been loaded into the notebook with its version being:")
print(plotly.__version__)

The library 'pd' (pandas) has been loaded into the notebook with its version being:
1.2.4
The library 'np' (numpy) has been loaded into the notebook with its version being:
1.20.1

The library 're' has been loaded into the notebook with its version being:
2.2.1

The library 'sys' has been loaded into the notebook with its version being:
3.8.8 (default, Apr 13 2021, 15:08:03) [MSC v.1916 64 bit (AMD64)]

The library 'plotly' has been loaded into the notebook with its version being:
5.13.1


### Defining the data directories

In addition, we also need to define our data directories from which the files will be loaded and to which the resulting files will be stored. Please mind that these need to be changed to the desired directories to be able to work with the data directories.

In [30]:
data_directory_location_files = "C:/Users/laure/OneDrive/Documenten/Bachelor Thesis Data/location_data"
data_directory_final_datasets = "C:/Users/laure/OneDrive/Documenten/Bachelor Thesis Data/final_datasets"
data_directory_final_datasets_CpG = "C:/Users/laure/OneDrive/Documenten/Bachelor Thesis Data/final_datasets/CpG Site Analysis"

## Loading the Location Data

Within this section, the files are loaded into this notebook which contain the location data of the genes and the location data of the CpG sites.

### Loading the 'genes_location_data.csv' file

Loading the 'genes_location_data.csv' file into this notebook by calling the function 'pd.read_csv()' with as a parameter the to be read file.

In [31]:
# Loading the file 'genes_location_data.csv'.
genes_location_data = pd.read_csv(data_directory_location_files + '/genes_location_data.csv')

print("The 'genes_location_data' DataFrame containing the location data of the genes:")
genes_location_data

The 'genes_location_data' DataFrame containing the location data of the genes:


Unnamed: 0,gene,chromosome,start_position,end_position,strand
0,ENSG00000227232,chr1,14404,29570.0,-
1,ENSG00000278267,chr1,17369,17436.0,-
2,ENSG00000268903,chr1,135141,135895.0,-
3,ENSG00000269981,chr1,137682,137965.0,-
4,ENSG00000279457,chr1,185217,195411.0,-
...,...,...,...,...,...
19626,ENSG00000008735,chr22,50600793,50613981.0,+
19627,ENSG00000100299,chr22,50622754,50628173.0,-
19628,ENSG00000251322,chr22,50674415,50733212.0,+
19629,ENSG00000079974,chr22,50767501,50783667.0,-


### Loading the 'CpG_sites_location_data.csv' file

Loading the 'CpG_sites_location_data.csv' file into this notebook by calling the function 'pd.read_csv()' with as a parameter the to be read file.

In [32]:
# Loading the file 'CpG_sites_location_data.csv'.
CpG_sites_location_data = pd.read_csv(data_directory_location_files + '/CpG_sites_location_data.csv')

print("The 'CpG_sites_location_data' DataFrame containing the location data of the CpG sites:")
CpG_sites_location_data

The 'CpG_sites_location_data' DataFrame containing the location data of the CpG sites:


Unnamed: 0,CpG_site,chromosome,position,strand
0,cg13869341,chr1,15865,+
1,cg14008030,chr1,18827,+
2,cg20826792,chr1,29425,+
3,cg20253340,chr1,68849,+
4,cg21870274,chr1,69591,-
...,...,...,...,...
270873,cg12121634,chr22,51194096,-
270874,cg04757410,chr22,51195437,-
270875,cg09456760,chr22,51206645,+
270876,cg07660283,chr22,51223343,+


## Loading the M-transformed Methylation Data & Log2-transformed Gene Expression Data Files

Within this section, the file 'methylation_data_M_transformed_final.csv' and the file 'gene_expression_data_log2_transformed_final.csv' from the directory 'data_directory_final_datasets' are loaded into this notebook by calling the function 'pd.read_csv()' with as a parameter the to be read file.

#### Loading the 'methylation_data_M_transformed_final.csv' file into this notebook

In [33]:
# Loading the file 'methylation_data_M_transformed.csv'.
methylation_data_M_transformed = pd.read_csv(data_directory_final_datasets + '/methylation_data_M_transformed_final.csv')

print("The 'methylation_data_M_transformed' DataFrame:")
methylation_data_M_transformed

The 'methylation_data_M_transformed' DataFrame:


Unnamed: 0,Samples,cg00000957,cg00001349,cg00001583,cg00002837,cg00003287,cg00004121,cg00008647,cg00009292,cg00011717,...,ch.22.28920330F,ch.22.436090R,ch.22.441164F,ch.22.528917R,ch.22.569473R,ch.22.707049R,ch.22.728807R,ch.22.734399R,ch.22.772318F,ch.22.909671F
0,TCGA-06-0125-01A-01,3.132755,3.960518,-5.452737,2.348422,0.642046,0.425173,-3.568310,0.090094,4.677447,...,-4.328807,-1.604700,-5.391144,-4.299671,-3.299155,-5.006189,-4.483695,-1.512707,-4.935511,-4.526937
1,TCGA-06-0125-02A-11,3.196057,3.825019,-5.503606,1.372434,0.849407,0.629880,-3.292764,1.242929,5.700119,...,-2.854379,-1.720475,-5.169584,-4.430285,-3.100187,-4.953307,-3.404266,-1.763778,-4.931599,-3.668569
2,TCGA-06-0152-02A-01,4.057813,3.626717,1.146710,-0.208610,0.059986,0.788350,-0.941862,1.416831,5.731835,...,-4.614915,-2.221432,-5.487836,-4.947837,-2.324764,-4.632719,-4.005875,-1.983516,-4.787276,-3.809136
3,TCGA-06-0171-02A-11,4.139295,3.058785,-1.109405,-0.179801,-2.005682,0.634832,-4.202176,-0.933237,6.317184,...,-3.329587,-3.007868,-5.728251,-3.993713,-3.442227,-4.986055,-3.436608,-2.831527,-5.334004,-4.291577
4,TCGA-06-0190-01A-01,3.179215,3.408476,-4.613496,-0.233366,-0.672902,0.370326,-1.536587,0.712060,4.594485,...,-3.925038,-2.666686,-5.455511,-4.393648,-3.584784,-5.394690,-3.741805,-2.292808,-4.840999,-3.959180
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
59,TCGA-76-4927-01A-01,3.427987,4.101014,-0.268118,-0.005975,-0.769480,0.898056,-5.020576,1.241059,5.889282,...,-4.408250,-1.532935,-3.325032,-2.215352,-3.453181,-3.190405,-3.143200,-0.768262,-2.643066,-1.345063
60,TCGA-76-4928-01B-01,2.091628,3.824918,-5.990499,0.344366,-0.717767,0.425782,-5.487022,0.370565,5.714939,...,-5.370823,-1.044793,-4.456032,-2.967873,-3.687673,-3.546754,-3.881002,-1.597628,-4.209691,-2.319970
61,TCGA-76-4929-01A-01,3.166884,3.437609,-6.054600,1.700659,-4.722319,0.514961,-5.351739,1.562834,3.495553,...,-4.206406,-1.537983,-3.914480,-2.712960,-3.349122,-2.192265,-2.538937,-1.356304,-2.419188,-1.598524
62,TCGA-76-4931-01A-01,2.464759,3.631037,3.468339,0.585791,0.666281,0.879107,-5.656808,3.381448,4.041543,...,-4.784318,-1.500444,-4.241542,-4.235881,-3.584725,-2.905236,-3.847968,-0.846516,-4.245142,-1.847410


#### Loading the 'gene_expression_data_log2_transformed_final.csv' file into this notebook

In [34]:
# Loading the file 'gene_expression_data_log2_transformed_final.csv'.
gene_expression_data_log2_transformed = pd.read_csv(data_directory_final_datasets + '/gene_expression_data_log2_transformed_final.csv')

print("The 'gene_expression_data_log2_transformed' DataFrame:")
gene_expression_data_log2_transformed

The 'gene_expression_data_log2_transformed' DataFrame:


Unnamed: 0,Samples,ENSG00000000419,ENSG00000000457,ENSG00000000460,ENSG00000000938,ENSG00000000971,ENSG00000001036,ENSG00000001084,ENSG00000001167,ENSG00000001460,...,ENSG00000288558,ENSG00000288559,ENSG00000288573,ENSG00000288586,ENSG00000288596,ENSG00000288612,ENSG00000288658,ENSG00000288667,ENSG00000288670,ENSG00000288675
0,TCGA-06-0125-01A-01,6.621171,2.915100,2.804239,3.011692,4.164312,6.016307,3.955192,4.659828,0.985136,...,3.341360,4.047233,2.695816,1.547549,3.044482,2.704385,0.767655,0.000000,3.779134,2.368740
1,TCGA-06-0125-02A-11,6.010155,2.698863,2.048550,4.123418,4.123277,6.087189,4.578244,4.262696,1.269931,...,2.808200,3.227371,2.401084,1.382944,1.933988,1.640852,0.700617,1.291662,4.009482,1.681854
2,TCGA-06-0152-02A-01,6.631346,2.883797,2.764750,4.523010,4.343927,6.211419,3.390117,4.676335,1.245252,...,3.344246,3.576861,2.495260,1.481764,2.833011,2.196733,1.697285,0.000000,3.700562,2.107219
3,TCGA-06-0171-02A-11,5.820404,2.595169,1.913148,6.059275,5.694766,6.063341,4.459025,4.545616,2.499221,...,2.170181,2.286674,1.960734,0.931078,1.921132,2.126444,0.766722,0.000000,3.113250,0.603597
4,TCGA-06-0190-01A-01,6.351695,2.551934,2.991481,4.729645,4.695788,6.674928,4.204962,4.891721,2.695816,...,2.444137,3.047172,2.077277,0.928048,2.468505,1.875387,0.229834,0.000000,2.827270,2.018029
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
59,TCGA-76-4927-01A-01,6.032791,2.697929,2.486508,3.600924,3.809723,6.338483,4.089176,4.714811,3.187578,...,3.681371,3.211012,1.889084,1.171783,2.123335,1.187134,1.310689,0.000000,3.179065,2.604119
60,TCGA-76-4928-01B-01,6.496410,2.381450,2.285521,4.086308,2.995231,6.425281,3.453439,4.351275,2.183042,...,3.023078,2.557386,1.857344,0.440952,2.366308,1.622790,0.516923,0.000000,2.983185,1.395556
61,TCGA-76-4929-01A-01,6.521268,3.111098,1.758175,4.619372,3.469834,4.941332,4.311561,5.589269,2.704983,...,3.427901,2.907641,2.535580,2.031395,2.443501,1.956837,2.229649,0.000000,3.455663,3.203295
62,TCGA-76-4931-01A-01,6.414766,2.932118,2.799709,3.526820,3.366028,4.879471,4.093070,4.822628,2.103296,...,3.399718,3.740863,3.189129,1.569199,3.316349,3.349351,0.928579,0.000000,3.875239,2.824299


## Selecting Set of Predictable Genes

Within this section, we can retrieve the set of genes for which all the genes within that set have at least a few (2) CpG sites correlated to them which thus have a higher correlation with the gene than a certain threshold. Since for the CpG Site Analysis part only a single distance is used, we should here choose the distance that will also be used for the remainder of the CpG Site Analysis part to ensure that the property of every gene having at least 2 CpG sites correlated to them within the distance holds after having performed the filtering in this notebook. From the notebooks 'Analyzing Linear Regression Results for Distance Analysis.ipynb', 'Analyzing Lasso Regression Results for Distance Analysis.ipynb', 'Analyzing Ridge Regression Results for Distance Analysis.ipynb', and 'Analyzing Elastic Net Regression Results for Distance Analysis.ipynb' we can conclude that a distance of 250,000,000 in general performs the best (of course also realizing that there may be a separate best performing distance for each of the four machine learning algorithms applied for the Distance Analysis part. This distance will be used for the remainder of the CpG Analysis part and thus also for the experiments performed within this notebook. The next step thus would be to decide what the correlation threshold is.

In [54]:
# Setting the distance to be equal to 250,000,000.
distance = 250000000

As we can see from the plots above not a lot of genes have multiple CpG sites within a distance of 250,000,000 in both directions from them with a high Pearson's correlation coefficient. Although it might seem counterintuitive to the argument of how methylation affects gene expression, it does make sense considering that DNA methylation does indeed affect gene expression levels but is just one of several epigenetic mechanisms that cells use to control gene expression [3]. Because of this, I have decided to set the correlation threshold to 0.3 since this still indicates there is a positive (negative), although low, correlation but does not assume that DNA methylation is the only mechanism that affects the gene expression levels within cells [4].

Next is to find those genes for which there is at most 1 CpG site that has a higher or equal correlation coefficient with the gene than the threshold of 0.3. This can be achieved by looping over every gene present in the 'genes_location_data' DataFrame (which are also the same genes present in the 'gene_expression_data_log2_transformed' DataFrame) and retrieving how many CpG sites have a higher or equal correlation coefficient with the gene than the threshold of 0.3 for each of the genes. If this number is lower than 2, the gene will be added to a list. This list is then later used to remove those genes from the two gene expression datasets.

In [None]:
# Defining the correlation threshold.
threshold = 0.30

# Transposing the 'methylation_data_M_transformed' DataFrame as this allows for faster indexing as we can then index 
# the rows instead of the columns by calling the property 'T'.
methylation_data_transposed = methylation_data_M_transformed.T

# Setting the first column to be the index and dropping that column by respectively calling the functions 
# 'set_axis()' and 'drop()'.
methylation_data_transposed = methylation_data_transposed.set_axis(methylation_data_transposed.iloc[0], axis=1)
methylation_data_transposed = methylation_data_transposed.drop(methylation_data_transposed.index[0])

# This function retrieves whether the 'gene' should be removed based on its correlation coefficient with the CpG sites.
def removing_genes_with_low_correlation(gene):
    
    # Defining a list which will feature the 'gene' if it has been decided to remove it.
    genes_to_remove = []
    
    # Checking whether the gene is of the correct format and the correct length.
    if re.match('^ENSG00000\d{6}$', gene) and (len(gene) == 15):
        # Retrieving the chromosome, start position and end position of the gene.
        chromosome = (genes_location_data[genes_location_data['gene'] == gene])['chromosome'].iloc[0]
        start_position = (genes_location_data[genes_location_data['gene'] == gene])['start_position'].iloc[0]
        end_position = (genes_location_data[genes_location_data['gene'] == gene])['end_position'].iloc[0]
        
        # Retrieving the CpG sites from the 'CpG_sites_location_data' DataFrame which have as their chromosome the 
        # 'chromosome' variable. 
        CpG_sites_from_chromosome = CpG_sites_location_data[CpG_sites_location_data["chromosome"] == chromosome]
        
        # Retrieving the CpG sites in the 'CpG_sites_from_chromosome' DataFrame of which the positions lie within the 
        # 'distance' from the 'start_position' and 'end_position' of the gene'. Of course, the CpG sites that are located
        # inbetween these two positions are included as well.
        CpG_sites_within_distance_and_between_positions = CpG_sites_from_chromosome[((CpG_sites_from_chromosome["position"] >= start_position - distance) & 
                                            (CpG_sites_from_chromosome["position"] <= end_position + distance)) |
                                            ((CpG_sites_from_chromosome["position"] > start_position) & 
                                            (CpG_sites_from_chromosome["position"] < end_position))]['CpG_site'].tolist()
        
        # Retrieving a subset of the rows of the 'methylation_data_transposed' DataFrame by checking whether the CpG sites
        # appear within the 'CpG_sites_within_distance_and_between_positions' DataFrame. This can be achieved by calling the 
        # function 'isin()'.
        CpG_sites_within_distance_and_between_positions_from_methylation = methylation_data_transposed[
                                            methylation_data_transposed.index.isin(CpG_sites_within_distance_and_between_positions)]
        
        # Retrieving the data of the current 'gene' in the form of a list by calling the property 'values' and the function 
        # 'tolist()'. This data is reshaped by calling the function 'reshape()'.
        gene_data = np.array(gene_expression_data_log2_transformed[gene].values.tolist()).reshape(1,-1)

        # Retrieving the data of all the Cpg sites present within the DataFrame 
        # 'CpG_sites_within_distance_and_between_positions_from_methylation' by calling the property 'values' and converting
        # it into a float array by calling the function 'astype(float)'.
        CpG_data = np.array(CpG_sites_within_distance_and_between_positions_from_methylation.values).astype(float)
        
        # Retrieving the correlation coefficients of the 'gene' with every CpG site present in the 'CpG_data' array by 
        # calling the function 'np.corrcoef()' from the 'numpy' library (the absolute values are retrieved by calling the 
        # function 'np.abs()'). Since a correlation matrix is returned, we retrieve one of the columns featuring the 
        # non-diagonal elements by ommitting the first value.
        correlation_coefficients = np.abs(np.corrcoef(gene_data, CpG_data)[0, 1:])
        
        # If there are less than 2 correlation coefficients within 'correlation_coefficients' that are higher than or equal 
        # to 0.3, then the variable 'genes_to_remove' is set to True.
        if (np.sum(correlation_coefficients >= 0.3)) < 2:
            to_remove = True
        else: 
            to_remove = False
            
    else:
        sys.exit("The gene is not of the correct format and is not present in the gene expression data.")
            
    return {'gene': gene, 'remove': to_remove}

# Retrieving the genes present in the 'genes_location_data' DataFrame.
genes = genes_location_data['gene']
    
# Defining a list featuring all the genes to remove as they have at most 1 CpG site that has a higher or equal correlation
# coefficient with the gene than the threshold of 0.30. This can be achieved by calling the function
# 'removing_genes_with_low_correlation()' for each of the genes. Since the computations for a single gene do not influence 
# the computations of any other gene, we can parallelize the execution of this function by calling the function 'Parallel()' 
# from the 'joblib' library.
genes_to_remove = Parallel(n_jobs=8)(delayed(removing_genes_with_low_correlation)(gene) for gene in genes)

# Defining the dictionary which will store which genes should be removed.
dictionary_genes_to_remove = {gen['gene']: gen['remove'] for gen in genes_to_remove}

# Creating the DataFrame by calling the constructor 'DataFrame()' which takes as input the dictionary to be converted into a
# DataFrame.
genes_to_remove_df = pd.DataFrame({'genes_to_remove': []})

# Adding the names of the all the genes to the DataFrame.
genes_to_remove_df.insert(0, 'Gene', genes)

# Inserting the values into the DataFrame.
genes_to_remove_df['genes_to_remove'] = dictionary_genes_to_remove.values()

As we can see from the output above, 0 genes are present in the 'genes_to_remove' list. This means that none of the 19,631 genes have at most 1 CpG site located within a distance of 250,000,000 from them. This makes the removal process very easy as we can just store the 'gene_expression_data_log2_transformed' to the directory 'data_directory_final_datasets_CpG'.

## Storing the Resulting Gene Expression Data

#### Storing the resulting 'gene_expression_data_log2_transformed' DataFrame

Now that we have completed removing the genes from our original 'gene_expression_data_log2_transformed' DataFrame we can store its result to the local directory 'data_directory_final_datasets_CpG'.

In [55]:
# Defining where to save the resulting file and its name.
file_to_save = data_directory_final_datasets_CpG + "/gene_expression_data_log2_transformed_correlation_genes_removed.csv"

# Saving the file.
if not os.path.exists(file_to_save):
    gene_expression_data_log2_transformed.to_csv(file_to_save, index=False)
    print("The file with the path " + file_to_save + " has been created.")
else:
    print("The file with the path " + file_to_save + " already exists.")

The file with the path C:/Users/laure/OneDrive/Documenten/Bachelor Thesis Data/final_datasets/CpG Site Analysis/gene_expression_data_log2_transformed_correlation_genes_removed.csv has been created.
