# Analyzing the CpG Sites Importance Scores
### Laurence Nickel (i6257119)

Libraries used: 
* pandas (version: '1.2.4')
* re (version: '2.2.1')
* sys (version: '3.8.8')
* os (version: '3.8.8')
* plotly.express (version: '5.13.1')
* seaborn (version: '0.11.1')
* matplotlib.pyplot (version: '3.3.4')

References:
* [1] Frumkin, I., Lajoie, M. J., Gregg, C., Hornung, G., Church, G. M., & Pilpel, Y. (2018). Codon usage of highly expressed genes affects proteome-wide translation efficiency. *Proceedings of the National Academy of Sciences of the United States of America, 115*(21), E4940-E4949. doi: https://doi.org/10.1073/pnas.1719375115.

## Introduction

Within this notebook, the results of retrieving the importance scores of the CpG sites are analyzed. The coefficients of the models with positive R<sup>2</sup> scores were used to compute these importance scores and they were retrieved from 4 different machine learning algorithms:
* Linear Regression
* Lasso Regression
* Ridge Regression
* Elastic Net Regression

There were two different ways of how the importance scores of the CpG sites were computed using the coefficients:
* Approach 1: Adding up all the coefficients values for each CpG site and calculating the importance score for each CpG site by dividing by the number of models the CpG site is utilized in. This approach is biased towards highly expressed genes as the coefficients of the CpG sites for those highly expressed genes will be higher. These highly expressed genes, however, are considered to be quite important as these kind of genes are most likely to have an active role in biological processes [1].
* Approach 2: Normalizing for each of the models the coefficients such that all the coefficients for a single model add up to 1. This is done such that the coefficients of the CpG sites for highly expressed genes do not automatically result in a higher importance score. To retrieve the importance score, all the coefficients values for each CpG site are added up and divided by the number of models the CpG site is utilized in.

Both of these approaches will be analyzed to investigate as to whether the approach of calculating the importance score returns a different order of important CpG sites.

### Importing libraries

Before we can start to define all the functions, we should first import some libraries that will be used throughout this notebook.

In [1]:
print("Starting the importing of the libraries...")


import pandas as pd
import re
import sys
import os

# Here we first need to install the plotly library.
!pip install plotly
import plotly
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns


print("Finishing the installing of the libraries.")

Starting the importing of the libraries...
Finishing the installing of the libraries.


Now that all the libraries have been imported, we can verify that these libraries have been loaded into this notebook by calling the version property of the library.

In [2]:
# Retrieving the version of the libraries to verify they have been correctly loaded into this notebook.
print("The library 'pd' (pandas) has been loaded into the notebook with its version being:")
print(pd.__version__)

print("\nThe library 're' has been loaded into the notebook with its version being:")
print(re.__version__)

print("\nThe library 'sys' has been loaded into the notebook with its version being:")
print(sys.version)

print("\nThe library 'plotly' has been loaded into the notebook with its version being:")
print(plotly.__version__)

print("\nThe library 'sns' (seaborn) has been loaded into the notebook with its version being:")
print(sns.__version__)

print("\nThe library 'matplotlib' has been loaded into the notebook with its version being:")
print(matplotlib.__version__)

The library 'pd' (pandas) has been loaded into the notebook with its version being:
1.2.4

The library 're' has been loaded into the notebook with its version being:
2.2.1

The library 'sys' has been loaded into the notebook with its version being:
3.8.8 (default, Apr 13 2021, 15:08:03) [MSC v.1916 64 bit (AMD64)]

The library 'plotly' has been loaded into the notebook with its version being:
5.13.1

The library 'sns' (seaborn) has been loaded into the notebook with its version being:
0.11.1

The library 'matplotlib' has been loaded into the notebook with its version being:
3.3.4


### Defining the data directories

In addition, we also need to define our data directories from which the CpG importance scores files will be loaded. Please mind that these need to be changed to the desired directories to be able to work with the data directories.

In [3]:
data_directory_results_CpG_linear = "C:/Users/laure/OneDrive/Documenten/Bachelor Thesis Data/results/CpG Site Analysis/Linear Regression"
data_directory_results_CpG_lasso = "C:/Users/laure/OneDrive/Documenten/Bachelor Thesis Data/results/CpG Site Analysis/Lasso Regression"
data_directory_results_CpG_ridge = "C:/Users/laure/OneDrive/Documenten/Bachelor Thesis Data/results/CpG Site Analysis/Ridge Regression"
data_directory_results_CpG_elastic_net = "C:/Users/laure/OneDrive/Documenten/Bachelor Thesis Data/results/CpG Site Analysis/Elastic Net Regression"

## Loading CpG Importance Scores Files

Within this section, we can load the CpG importance scores files from the directories 'data_directory_results_CpG_linear', 'data_directory_results_CpG_lasso', 'data_directory_results_CpG_ridge', and 'data_directory_results_CpG_elastic_net' into this notebook by calling the function 'pd.read_csv()' with as a parameter the to be read file.

#### Loading the CpG importance scores files for approach 1

In [4]:
# Loading the file 'CpG_importance_scores_approach1_sorted.csv'.
CpG_importance_scores_approach1_linear = pd.read_csv(data_directory_results_CpG_linear + '/CpG_importance_scores_approach1_sorted.csv')

print("The 'CpG_importance_scores_approach1_linear' DataFrame:")
CpG_importance_scores_approach1_linear

The 'CpG_importance_scores_approach1_linear' DataFrame:


Unnamed: 0,CpG Site,Coefficient Summation,Number of Models,Importance Score
0,cg03701992,0.330313,1.0,0.330313
1,cg00106744,0.746499,3.0,0.248833
2,cg05946920,0.450255,2.0,0.225128
3,ch.21.818333F,1.091802,5.0,0.218360
4,cg20678665,0.215275,1.0,0.215275
...,...,...,...,...
270873,cg13136556,0.000000,0.0,0.000000
270874,cg22469122,0.000000,0.0,0.000000
270875,cg20648398,0.000000,0.0,0.000000
270876,cg13142612,0.000000,0.0,0.000000


In [5]:
# Loading the file 'CpG_importance_scores_approach1_sorted.csv'.
CpG_importance_scores_approach1_lasso = pd.read_csv(data_directory_results_CpG_lasso + '/CpG_importance_scores_approach1_sorted.csv')

print("The 'CpG_importance_scores_approach1_lasso' DataFrame:")
CpG_importance_scores_approach1_lasso

The 'CpG_importance_scores_approach1_lasso' DataFrame:


Unnamed: 0,CpG Site,Coefficient Summation,Number of Models,Importance Score
0,cg26941801,0.880357,3.0,0.293452
1,cg23081781,1.040507,4.0,0.260127
2,cg25453430,2.028791,11.0,0.184436
3,cg25952615,0.496514,3.0,0.165505
4,cg01074001,0.144059,1.0,0.144059
...,...,...,...,...
270873,cg18220168,0.000000,387.0,0.000000
270874,cg08057037,0.000000,80.0,0.000000
270875,cg08051049,0.000000,192.0,0.000000
270876,cg08048948,0.000000,35.0,0.000000


In [6]:
# Loading the file 'CpG_importance_scores_approach1_sorted.csv'.
CpG_importance_scores_approach1_ridge = pd.read_csv(data_directory_results_CpG_ridge + '/CpG_importance_scores_approach1_sorted.csv')

print("The 'CpG_importance_scores_approach1_ridge' DataFrame:")
CpG_importance_scores_approach1_ridge

The 'CpG_importance_scores_approach1_ridge' DataFrame:


Unnamed: 0,CpG Site,Coefficient Summation,Number of Models,Importance Score
0,cg25453430,1.769624,7.0,0.252803
1,cg23081781,0.831556,4.0,0.207889
2,cg00106744,0.487061,3.0,0.162354
3,cg03701992,0.160495,1.0,0.160495
4,cg04960798,0.786360,5.0,0.157272
...,...,...,...,...
270873,cg15127832,0.000000,0.0,0.000000
270874,cg24786671,0.000000,0.0,0.000000
270875,cg23361356,0.000000,0.0,0.000000
270876,cg15269875,0.000000,0.0,0.000000


In [7]:
# Loading the file 'CpG_importance_scores_approach1_sorted.csv'.
CpG_importance_scores_approach1_elastic_net = pd.read_csv(data_directory_results_CpG_elastic_net + '/CpG_importance_scores_approach1_sorted.csv')

print("The 'CpG_importance_scores_approach1_elastic_net' DataFrame:")
CpG_importance_scores_approach1_elastic_net

The 'CpG_importance_scores_approach1_elastic_net' DataFrame:


Unnamed: 0,CpG Site,Coefficient Summation,Number of Models,Importance Score
0,cg23081781,0.692151,4.0,0.173038
1,cg26941801,0.645541,4.0,0.161385
2,cg10050487,0.125320,1.0,0.125320
3,cg25952615,0.480114,4.0,0.120029
4,cg19469449,0.204914,2.0,0.102457
...,...,...,...,...
270873,cg01666436,0.000000,34.0,0.000000
270874,cg01901022,0.000000,11.0,0.000000
270875,cg01929377,0.000000,2.0,0.000000
270876,cg01934527,0.000000,17.0,0.000000


#### Loading the CpG importance scores files for approach 2

In [8]:
# Loading the file 'CpG_importance_scores_approach2_sorted.csv'.
CpG_importance_scores_approach2_linear = pd.read_csv(data_directory_results_CpG_linear + '/CpG_importance_scores_approach2_sorted.csv')

print("The 'CpG_importance_scores_approach2_linear' DataFrame:")
CpG_importance_scores_approach2_linear

The 'CpG_importance_scores_approach2_linear' DataFrame:


Unnamed: 0,CpG Site,Coefficient Summation,Number of Models,Importance Score
0,cg03701992,0.088435,1.0,0.088435
1,cg00106744,0.228844,3.0,0.076281
2,ch.21.818333F,0.347567,5.0,0.069513
3,cg25453430,0.337729,5.0,0.067546
4,cg17900854,0.215574,4.0,0.053893
...,...,...,...,...
270873,cg11926610,0.000000,0.0,0.000000
270874,cg01223193,0.000000,0.0,0.000000
270875,cg07306881,0.000000,0.0,0.000000
270876,cg02169859,0.000000,0.0,0.000000


In [9]:
# Loading the file 'CpG_importance_scores_approach2_sorted.csv'.
CpG_importance_scores_approach2_lasso = pd.read_csv(data_directory_results_CpG_lasso + '/CpG_importance_scores_approach2_sorted.csv')

print("The 'CpG_importance_scores_approach2_lasso' DataFrame:")
CpG_importance_scores_approach2_lasso

The 'CpG_importance_scores_approach2_lasso' DataFrame:


Unnamed: 0,CpG Site,Coefficient Summation,Number of Models,Importance Score
0,cg25453430,2.282115,11.0,0.207465
1,cg26941801,0.549171,3.0,0.183057
2,cg13732083,1.118293,8.0,0.139787
3,cg19924120,0.384832,3.0,0.128277
4,cg00092711,0.254835,2.0,0.127417
...,...,...,...,...
270873,cg22667358,0.000000,71.0,0.000000
270874,cg01871526,0.000000,560.0,0.000000
270875,cg22664798,0.000000,86.0,0.000000
270876,cg22658238,0.000000,15.0,0.000000


In [10]:
# Loading the file 'CpG_importance_scores_approach2_sorted.csv'.
CpG_importance_scores_approach2_ridge = pd.read_csv(data_directory_results_CpG_ridge + '/CpG_importance_scores_approach2_sorted.csv')

print("The 'CpG_importance_scores_approach2_ridge' DataFrame:")
CpG_importance_scores_approach2_ridge

The 'CpG_importance_scores_approach2_ridge' DataFrame:


Unnamed: 0,CpG Site,Coefficient Summation,Number of Models,Importance Score
0,cg04097543,0.103790,1.0,0.103790
1,cg25453430,0.632244,7.0,0.090321
2,cg03701992,0.080318,1.0,0.080318
3,cg00106744,0.174029,3.0,0.058010
4,cg04112169,0.330503,6.0,0.055084
...,...,...,...,...
270873,cg26060583,0.000000,0.0,0.000000
270874,cg07678583,0.000000,0.0,0.000000
270875,cg23166740,0.000000,0.0,0.000000
270876,cg07557491,0.000000,0.0,0.000000


In [11]:
# Loading the file 'CpG_importance_scores_approach2_sorted.csv'.
CpG_importance_scores_approach2_elastic_net = pd.read_csv(data_directory_results_CpG_elastic_net + '/CpG_importance_scores_approach2_sorted.csv')

print("The 'CpG_importance_scores_approach2_elastic_net' DataFrame:")
CpG_importance_scores_approach2_elastic_net

The 'CpG_importance_scores_approach2_elastic_net' DataFrame:


Unnamed: 0,CpG Site,Coefficient Summation,Number of Models,Importance Score
0,cg25453430,1.228964,11.0,0.111724
1,cg10050487,0.110369,1.0,0.110369
2,cg26941801,0.356198,4.0,0.089050
3,cg04097543,0.177206,2.0,0.088603
4,cg26958236,0.433555,5.0,0.086711
...,...,...,...,...
270873,cg21752292,0.000000,88.0,0.000000
270874,cg21752340,0.000000,73.0,0.000000
270875,cg21754400,0.000000,216.0,0.000000
270876,cg21757048,0.000000,67.0,0.000000


## Analyzing the CpG Sites Importance Scores

Within this section, the importance scores of the CpG sites present in the four DataFrames above for approach 1 and the four DataFrames above for approach 2 are analyzed.

To identify the most important CpG sites, the key CpG sites in DNA methylation that affect gene expression in brain cancer, we count how many times each CpG site appears in the top 10 highest importance scores across the four different DataFrames for a single approach. This can be done independently for both of the approaches.

#### Analyzing the top 10 highest importance scores across the four different DataFrames for approach 1

In [12]:
# Creating a list storing the CpG sites present in the top 10 CpG sites with the highest importance score for all the four
# DataFrames belonging to approach 1.
top_10_lists_approach1 = [
    CpG_importance_scores_approach1_linear[:10]['CpG Site'].to_list(),
    CpG_importance_scores_approach1_lasso[:10]['CpG Site'].to_list(),
    CpG_importance_scores_approach1_ridge[:10]['CpG Site'].to_list(),
    CpG_importance_scores_approach1_elastic_net[:10]['CpG Site'].to_list(),
]

# Creating a new dictionary that will store for each of the CpG sites how often they appear in the 'top_10_lists_approach1'.
CpG_site_counts_approach1 = {}

# Looping over every list present in the 'top_10_lists_approach1' and counting how many times it appears in any of the lists.
for CpG_importance_list in top_10_lists_approach1:
    for CpG_importance_score in CpG_importance_list:
        if CpG_importance_score in CpG_site_counts_approach1:
            CpG_site_counts_approach1[CpG_importance_score] += 1
        else:
            CpG_site_counts_approach1[CpG_importance_score] = 1
            
# Sorting the CpG sites present in the 'CpG_site_counts_approach1' dictionary based on their values which can be achieved by 
# calling the functions 'sorted()' and 'items()' and make use of a lambda expression.
CpG_site_counts_approach1_sorted = dict(sorted(CpG_site_counts_approach1.items(), key=lambda x: x[1], reverse=True))

print("The 'CpG_site_counts_approach1_sorted' dictionary sorted in descending order:")
CpG_site_counts_approach1_sorted

The 'CpG_site_counts_approach1_sorted' dictionary sorted in descending order:


{'cg23081781': 4,
 'cg25453430': 4,
 'cg25952615': 3,
 'cg03701992': 2,
 'cg00106744': 2,
 'ch.21.818333F': 2,
 'cg26941801': 2,
 'cg05946920': 1,
 'cg20678665': 1,
 'cg17900854': 1,
 'cg01300495': 1,
 'cg17937570': 1,
 'cg01074001': 1,
 'cg08599229': 1,
 'cg02216951': 1,
 'cg24308082': 1,
 'cg01546248': 1,
 'cg25554205': 1,
 'cg04960798': 1,
 'cg16809914': 1,
 'cg15176664': 1,
 'cg04112169': 1,
 'cg10050487': 1,
 'cg19469449': 1,
 'cg20220242': 1,
 'cg17005068': 1,
 'cg26958236': 1,
 'cg21200923': 1}

#### Analyzing the top 10 highest importance scores across the four different DataFrames for approach 2

In [13]:
# Creating a list storing the CpG sites present in the top 10 CpG sites with the highest importance score for all the four
# DataFrames belonging to approach 2.
top_10_lists_approach2 = [
    CpG_importance_scores_approach2_linear[:10]['CpG Site'].to_list(),
    CpG_importance_scores_approach2_lasso[:10]['CpG Site'].to_list(),
    CpG_importance_scores_approach2_ridge[:10]['CpG Site'].to_list(),
    CpG_importance_scores_approach2_elastic_net[:10]['CpG Site'].to_list(),
]

# Creating a new dictionary that will store for each of the CpG sites how often they appear in the 'top_10_lists_approach2'.
CpG_site_counts_approach2 = {}

# Looping over every list present in the 'top_10_lists_approach2' and counting how many times it appears in any of the lists.
for CpG_importance_list in top_10_lists_approach2:
    for CpG_importance_score in CpG_importance_list:
        if CpG_importance_score in CpG_site_counts_approach2:
            CpG_site_counts_approach2[CpG_importance_score] += 1
        else:
            CpG_site_counts_approach2[CpG_importance_score] = 1
            
# Sorting the CpG sites present in the 'CpG_site_counts_approach2' dictionary based on their values which can be achieved by 
# calling the functions 'sorted()' and 'items()' and make use of a lambda expression.
CpG_site_counts_approach2_sorted = dict(sorted(CpG_site_counts_approach2.items(), key=lambda x: x[1], reverse=True))

print("The 'CpG_site_counts_approach2_sorted' dictionary sorted in descending order:")
CpG_site_counts_approach2_sorted

The 'CpG_site_counts_approach2_sorted' dictionary sorted in descending order:


{'cg25453430': 4,
 'cg26958236': 3,
 'cg04097543': 3,
 'cg03701992': 2,
 'cg00106744': 2,
 'ch.21.818333F': 2,
 'cg10019329': 2,
 'cg25385366': 2,
 'cg26941801': 2,
 'cg19924120': 2,
 'cg00092711': 2,
 'cg19835478': 2,
 'cg04112169': 2,
 'cg17900854': 1,
 'cg20678665': 1,
 'cg10364942': 1,
 'cg13732083': 1,
 'cg10156586': 1,
 'cg03787899': 1,
 'cg03988107': 1,
 'cg10050487': 1,
 'cg10636297': 1,
 'cg24725931': 1}

Different things can be noted from the outputs above. For the frequency counts of the CpG sites in approach 1, we can observe that the CpG sites 'cg23081781' and 'cg25453430' appear in the top 10 highest importance scores of all the four DataFrames indicating that in general these are considered to be key CpG sites in DNA methylation that affect gene expression in brain cancer as their importance scores are high for whatever machine learning algorithm is applied. Since  approach 1 is biased towards highly expressed genes, the consistently important CpG sites identified through this approach are likely to have a significant effect on the highly expressed genes. The fact that these CpG sites appear frequently and have higher importance scores indicates their importance and significance in influencing the expression of those highly expressed genes which are considered to be quite important as these kind of genes are most likely to have an active role in biological processes [1]. For the frequency counts of the CpG sites in approach 2, we can observe that the CpG sites 'cg25453430', 'cg26958236', and 'cg04097543' appear in the top 10 highest importance scores of three/four DataFrames indicating that in general these are considered to be key CpG sites in DNA methylation that affect gene expression in brain cancer as their importance scores are high for whatever machine learning algorithm is applied. Since for approach 2 the coefficients were first normalized, these CpG sites were able to maintain their significance even after the normalization process suggesting that they are consistently important across the models.

We can also observe from the outputs above that the choice of the approach can influence the determination of key CpG sites to some extent. We can notice that some CpG sites are important and even have high frequency counts for both of the approaches, such as CpG site 'cg25453430' which appears in the top 10 highest importance scores of all the DataFrames across the two approaches. To further investigate this, the code below can be executed which retrieves which CpG sites appear in both approaches and which do not.

In [14]:
# Retrieving which CpG sites appear in both approaches by calling the function 'set()' and sorting them based on the total
# frequency counts for the CpG sites across the two approaches by calling the function 'sorted()'.
common_CpG_sites = set(CpG_site_counts_approach1_sorted.keys()) & set(CpG_site_counts_approach2_sorted.keys())
common_CpG_sites_sorted = sorted(common_CpG_sites, key=lambda CpG: CpG_site_counts_approach1_sorted[CpG] + CpG_site_counts_approach2_sorted[CpG], reverse=True)

# Retrieving which CpG sites appear in both approaches by calling the function 'set()' and sorting them based on the total
# frequency counts for the CpG sites by calling the function 'sorted()'.
CpG_sites_only_approach1 = set(CpG_site_counts_approach1_sorted.keys()) - set(CpG_site_counts_approach2_sorted.keys())
CpG_sites_only_approach1_sorted = sorted(CpG_sites_only_approach1, key=lambda CpG: CpG_site_counts_approach1_sorted[CpG], reverse=True)

# Retrieving which CpG sites appear in both approaches by calling the function 'set()' and sorting them based on the total
# frequency counts for the CpG sites by calling the function 'sorted()'.
CpG_sites_only_approach2 = set(CpG_site_counts_approach2_sorted.keys()) - set(CpG_site_counts_approach1_sorted.keys())
CpG_sites_only_approach2_sorted = sorted(CpG_sites_only_approach2, key=lambda CpG: CpG_site_counts_approach2_sorted[CpG], reverse=True)

print("Common CpG sites sorted by total frequency counts:")
for CpG in common_CpG_sites_sorted:
    print(CpG, 'count in Approach 1:', CpG_site_counts_approach1_sorted[CpG], 'and count in Approach 2:', CpG_site_counts_approach2_sorted[CpG])

print("\nCpG sites unique to Approach 1:")
for CpG in CpG_sites_only_approach1_sorted:
    print(CpG, 'count in Approach 1:', CpG_site_counts_approach1_sorted[CpG])

print("\nCpG sites unique to Approach 2:")
for CpG in CpG_sites_only_approach2_sorted:
    print(CpG, 'count in Approach 2:', CpG_site_counts_approach2_sorted[CpG])

Common CpG sites sorted by total frequency counts:
cg25453430 count in Approach 1: 4 and count in Approach 2: 4
cg00106744 count in Approach 1: 2 and count in Approach 2: 2
cg03701992 count in Approach 1: 2 and count in Approach 2: 2
cg26941801 count in Approach 1: 2 and count in Approach 2: 2
cg26958236 count in Approach 1: 1 and count in Approach 2: 3
ch.21.818333F count in Approach 1: 2 and count in Approach 2: 2
cg04112169 count in Approach 1: 1 and count in Approach 2: 2
cg17900854 count in Approach 1: 1 and count in Approach 2: 1
cg10050487 count in Approach 1: 1 and count in Approach 2: 1
cg20678665 count in Approach 1: 1 and count in Approach 2: 1

CpG sites unique to Approach 1:
cg23081781 count in Approach 1: 4
cg25952615 count in Approach 1: 3
cg17937570 count in Approach 1: 1
cg21200923 count in Approach 1: 1
cg15176664 count in Approach 1: 1
cg19469449 count in Approach 1: 1
cg01300495 count in Approach 1: 1
cg25554205 count in Approach 1: 1
cg01546248 count in Approach 1:

As we can see from the output above, there are quite some CpG sites which appear in the top 10 highest importance scores of three/four DataFrames in one approach, but do not even appear in any of the top 10 highest importance scores of the four DataFrames of the other approach. This suggest that the choice of the approach can influence the determination of key CpG sites. In addition, it also shows the importance of for example the CpG site 'cg25453430' as it appears in the top 10 highest importance scores of all the DataFrames across the two approaches.