# Machine Learning Additional Functions
### Laurence Nickel (i6257119)

Libraries used: 
* pandas (version: '1.2.4')
* numpy (version: '1.20.1')
* sys (version: '3.8.8')
* os (version: '3.8.8')
* re (version: '2.2.1')

## Introduction

Within this notebook, additional functions are declared which can be called from the appropriate notebooks which actually perform the machine learning techniques. By running the command '%run "Machine Learning Additional Functions.ipynb"' in the machine learning notebook, the functions are loaded into that notebook and can be called freely. In the section below a summary is included of all the functions and how they work. The methylation and gene expression data files are not loaded into this notebook but rather into the machine learning notebooks themselves and they are supplied to the functions as parameters. The files containing the location data of the CpG sites and the genes are loaded into this notebook.

### The Functions

To present a clear view of the functions that are featured below, the following overview is provided.
* The __'get_chromosome_and_location_from_gene(gene)'__ function: retrieves the chromosome and the exact location on that chromosome of the passed on gene.
* The __'get_methylation_data_from_chromosome(methylation_data, chromosome, distance)'__ function: retrieves a subset of the passed on 'methylation_data' DataFrame where the CpG sites are present on the passed on 'chromosome'.
* The __'get_methylation_data_close_to_gene(methylation_data, gene, distance)'__ function: retrieves a subset of the passed on 'methylation_data' DataFrame where the CpG sites are within a distance of 'distance' from the 'gene'.
* The __'get_methylation_data_close_to_gene_and_with_higher_correlation_than_threshold(methylation_data, gene_expression_data_current_gene, distance, threshold)'__ function: retrieves a subset of the passed on 'methylation_data' DataFrame where the CpG sites are within a distance of 'distance' from the gene and for which each of the CpG sites has a correlation coefficient higher than the threshold of 'threshold' with the gene.
* The __'get_methylation_data_close_to_chromosome_position(methylation_data, chromosome, position, distance)'__ function: retrieves a subset of the passed on 'methylation_data' DataFrame where the CpG sites are within a distance of 'distance' from the 'position' on the 'chromosome'.
* The __'get_gene_expression_data_from_chromosome(gene_expression_data, chromosome)'__ function: retrieves a subset of the passed on 'gene_expression_data' DataFrame where the genes are present on the passed on 'chromosome'.
* The __'get_gene_expression_data_with_2_CpG_sites_within_distance(gene_expression_data, methylation_data, distance)'__ function: retrieves a subset of the passed on 'gene_expression_data' DataFrame where the genes have at least 2 CpG sites present in the 'methylation_data' DataFrame located within a distance of 'distance' from the current gene in both directions.
* The __'get_gene_expression_data_with_2_CpG_sites_correlated_higher_than_threshold(gene_expression_data, methylation_data, threshold)'__ function: retrieves a subset of the passed on 'gene_expression_data' DataFrame where the genes have at least 2 CpG sites present in the 'methylation_data' DataFrame which have a higher or equal correlation than the 'threshold' with the current gene. 

### Importing libraries

Before we can start to define all the functions, we should first import some libraries that will be used throughout this notebook.

In [1]:
print("Starting the importing of the libraries...")


import pandas as pd
import numpy as np

import sys
import os
import re


print("Finishing the installing of the libraries.")

Starting the importing of the libraries...
Finishing the installing of the libraries.


Now that all the libraries have been imported, we can verify that these libraries have been loaded into this notebook by calling the version property of the library.

In [2]:
# Retrieving the version of the libraries to verify they have been correctly loaded into this notebook.
print("The library 'pd' (pandas) has been loaded into the notebook with its version being:")
print(pd.__version__)

print("The library 'np' (numpy) has been loaded into the notebook with its version being:")
print(np.__version__)

print("\nThe library 're' has been loaded into the notebook with its version being:")
print(re.__version__)

print("\nThe library 'sys' has been loaded into the notebook with its version being:")
print(sys.version)

The library 'pd' (pandas) has been loaded into the notebook with its version being:
1.2.4
The library 'np' (numpy) has been loaded into the notebook with its version being:
1.20.1

The library 're' has been loaded into the notebook with its version being:
2.2.1

The library 'sys' has been loaded into the notebook with its version being:
3.8.8 (default, Apr 13 2021, 15:08:03) [MSC v.1916 64 bit (AMD64)]


### Defining the data directory

In addition, we also need to define our data directory from which the location files will be loaded. Please mind that this one needs to be changed to the desired directory to be able to work with the data directory.

In [3]:
data_directory_location_files = "C:/Users/laure/OneDrive/Documenten/Bachelor Thesis Data/location_data"

## Loading the location data

Within this section, the files are loaded into this notebook which contain the location data of the CpG sites and the genes and the lengths of the chromosomes.

### Loading the 'CpG_sites_location_data.csv' file

Loading the 'CpG_sites_location_data.csv' file into this notebook by calling the function 'pd.read_csv()' with as a parameter the to be read file.

In [4]:
# Loading the file 'CpG_sites_location_data.csv'.
CpG_sites_location_data = pd.read_csv(data_directory_location_files + '/CpG_sites_location_data.csv')

print("The 'CpG_sites_location_data' DataFrame containing the location data of the CpG sites:")
CpG_sites_location_data

The 'CpG_sites_location_data' DataFrame containing the location data of the CpG sites:


Unnamed: 0,CpG_site,chromosome,position,strand
0,cg13869341,chr1,15865,+
1,cg14008030,chr1,18827,+
2,cg20826792,chr1,29425,+
3,cg20253340,chr1,68849,+
4,cg21870274,chr1,69591,-
...,...,...,...,...
270873,cg12121634,chr22,51194096,-
270874,cg04757410,chr22,51195437,-
270875,cg09456760,chr22,51206645,+
270876,cg07660283,chr22,51223343,+


### Loading the 'genes_location_data.csv' file

Loading the 'genes_location_data.csv' file into this notebook by calling the function 'pd.read_csv()' with as a parameter the to be read file.

In [5]:
# Loading the file 'genes_location_data.csv'.
genes_location_data = pd.read_csv(data_directory_location_files + '/genes_location_data.csv')

print("The 'genes_location_data' DataFrame containing the location data of the genes:")
genes_location_data

The 'genes_location_data' DataFrame containing the location data of the genes:


Unnamed: 0,gene,chromosome,start_position,end_position,strand
0,ENSG00000227232,chr1,14404,29570.0,-
1,ENSG00000278267,chr1,17369,17436.0,-
2,ENSG00000268903,chr1,135141,135895.0,-
3,ENSG00000269981,chr1,137682,137965.0,-
4,ENSG00000279457,chr1,185217,195411.0,-
...,...,...,...,...,...
19626,ENSG00000008735,chr22,50600793,50613981.0,+
19627,ENSG00000100299,chr22,50622754,50628173.0,-
19628,ENSG00000251322,chr22,50674415,50733212.0,+
19629,ENSG00000079974,chr22,50767501,50783667.0,-


### Loading the 'chromosomes_length_data.csv' file

Loading the 'chromosomes_length_data.csv' file into this notebook by calling the function 'pd.read_csv()' with as a parameter the to be read file.

In [6]:
# Loading the file 'chromosomes_length_data.csv'.
chromosomes_length_data = pd.read_csv(data_directory_location_files + '/chromosomes_length_data.csv')

print("The 'chromosomes_length_data' DataFrame containing the lengths of the chromosomes:")
chromosomes_length_data

The 'chromosomes_length_data' DataFrame containing the lengths of the chromosomes:


Unnamed: 0,chromosome,length
0,chr1,249250621
1,chr2,243199373
2,chr3,198022430
3,chr4,191154276
4,chr5,180915260
5,chr6,171115067
6,chr7,159138663
7,chr8,146364022
8,chr9,141213431
9,chr10,135534747


## The Functions

Within this section, the functions are defined which can then be called from the appropriate notebooks which actually perform the machine learning techniques.

### The <i>'get_chromosome_and_location_from_gene(gene)'</i> function

The function 'get_chromosome_and_location_from_gene(gene)' retrieves the chromosome and the exact location on that chromosome of the passed on gene.

In [7]:
# This function retrieves the chromosome and the exact location on that chromosome of the passed on gene.
def get_chromosome_and_location_from_gene(gene):
    
    # Checking whether the gene is of the correct format and the correct length.
    if re.match('^ENSG00000\d{6}$', gene) and (len(gene) == 15):
        # Retrieving the chromosome, start position and end position of the gene.
        chromosome = (genes_location_data[genes_location_data['gene'] == gene])['chromosome'].iloc[0]
        start_position = (genes_location_data[genes_location_data['gene'] == gene])['start_position'].iloc[0]
        end_position = (genes_location_data[genes_location_data['gene'] == gene])['end_position'].iloc[0]
    else:
        sys.exit("The gene is not of the correct format and is not present in the gene expression data.")
    
    return {"chromosome": chromosome, "start_position": start_position, "end_position": end_position}

### The <i>'get_methylation_data_from_chromosome(methylation_data, chromosome)'</i> function

The function 'get_methylation_data_from_chromosome(methylation_data, chromosome)' retrieves a subset of the passed on 'methylation_data' DataFrame where the CpG sites are present on the passed on 'chromosome'.

In [8]:
# This function retrieves a subset of the passed on 'methylation_data' DataFrame where the CpG sites are present on the 
# passed on 'chromosome'.
def get_methylation_data_from_chromosome(methylation_data, chromosome):

    # Checking whether the chromosome is a value between 1 and 22.
    pattern = r'^chr([1-9]|1[0-9]|2[0-2])$'
    if not re.match(pattern, str(chromosome)):
        if (isinstance(chromosome, int) and 1 <= chromosome <= 22) or (isinstance(chromosome, str) and chromosome.isdigit() and 1 <= int(chromosome) <= 22):
            chromosome = "chr" + str(chromosome)
        else:
            sys.exit("The chromosome given does not fall between 1 and 22.")
          
    # Retrieving the CpG sites from the 'CpG_sites_location_data' DataFrame which have as their chromosome the 'chromosome'
    # variable.
    CpG_sites_from_chromosome = CpG_sites_location_data[CpG_sites_location_data["chromosome"] == chromosome]['CpG_site'].tolist()
    
    # Transposing the 'methylation_data' DataFrame as this allows for faster indexing as we can then index the rows
    # instead of the columns by calling the property 'T'.
    methylation_data_transposed = methylation_data.T

    # Setting the first column to be the index and dropping that column by respectively calling the functions 
    # 'set_axis()' and 'drop()'.
    methylation_data_transposed = methylation_data_transposed.set_axis(methylation_data_transposed.iloc[0], axis=1)
    methylation_data_transposed = methylation_data_transposed.drop(methylation_data_transposed.index[0])

    # Retrieving a subset of the rows of the 'methylation_data_transposed' DataFrame by checking whether the CpG sites
    # appear within the 'CpG_sites_from_chromosome' DataFrame. This can be achieved by calling the function 'isin()'.
    CpG_sites_from_chromosome_from_methylation = methylation_data_transposed[
                                        methylation_data_transposed.index.isin(CpG_sites_from_chromosome)]

    # Transposing the 'CpG_sites_from_chromosome_from_methylation' DataFrame such that the CpG sites are the columns and the 
    # samples the rows (as this is the input expected by most machine learning algorithms).
    methylation_data_from_chromosome = CpG_sites_from_chromosome_from_methylation.T

    # Inserting the sample names currently present as the indices as the first column by calling the function 'insert()'
    # and dropping the indices by calling the function 'drop()'.
    methylation_data_from_chromosome.insert(0, 'Samples', methylation_data_from_chromosome.index)
    methylation_data_from_chromosome = methylation_data_from_chromosome.reset_index(drop=True)
    
    return methylation_data_from_chromosome

### The <i>'get_methylation_data_close_to_gene(methylation_data, gene, distance=500000)'</i> function:

The function 'get_methylation_data_close_to_gene(methylation_data, gene, distance=500000)' retrieves a subset of the passed on 'methylation_data' DataFrame where the CpG sites are within a distance of 'distance' from the 'gene'.

In [1]:
# This function retrieves a subset of the passed on 'methylation_data' DataFrame where the CpG sites are within a distance 
# of 'distance' from the 'gene'.
def get_methylation_data_close_to_gene(methylation_data, gene, distance=500000):

    # Checking whether the gene is of the correct format and the correct length.
    if re.match('^ENSG00000\d{6}$', gene) and (len(gene) == 15):
        # Retrieving the chromosome, start position and end position of the gene.
        chromosome = (genes_location_data[genes_location_data['gene'] == gene])['chromosome'].iloc[0]
        start_position = (genes_location_data[genes_location_data['gene'] == gene])['start_position'].iloc[0]
        end_position = (genes_location_data[genes_location_data['gene'] == gene])['end_position'].iloc[0]

        # Retrieving the CpG sites from the 'CpG_sites_location_data' DataFrame which have as their chromosome the 'chromosome'
        # variable. 
        CpG_sites_from_chromosome = CpG_sites_location_data[CpG_sites_location_data["chromosome"] == chromosome]
        
        # Retrieving the CpG sites in the 'CpG_sites_from_chromosome' DataFrame of which the positions lie within the 
        # 'distance' from the 'start_position' and 'end_position' of the gene'. Of course, the CpG sites that are located
        # inbetween these two positions are included as well.
        CpG_sites_within_distance_and_between_positions = CpG_sites_from_chromosome[((CpG_sites_from_chromosome["position"] >= start_position - distance) & 
                                            (CpG_sites_from_chromosome["position"] <= end_position + distance)) |
                                            ((CpG_sites_from_chromosome["position"] > start_position) & 
                                            (CpG_sites_from_chromosome["position"] < end_position))]['CpG_site'].tolist()
        
        # Transposing the 'methylation_data' DataFrame as this allows for faster indexing as we can then index the rows
        # instead of the columns by calling the property 'T'.
        methylation_data_transposed = methylation_data.T
        
        # Setting the first column to be the index and dropping that column by respectively calling the functions 
        # 'set_axis()' and 'drop()'.
        methylation_data_transposed = methylation_data_transposed.set_axis(methylation_data_transposed.iloc[0], axis=1)
        methylation_data_transposed = methylation_data_transposed.drop(methylation_data_transposed.index[0])
        
        # Retrieving a subset of the rows of the 'methylation_data_transposed' DataFrame by checking whether the CpG sites
        # appear within the 'CpG_sites_within_distance_and_between_positions' DataFrame. This can be achieved by calling the 
        # function 'isin()'.
        CpG_sites_within_distance_and_between_positions_from_methylation = methylation_data_transposed[
                                            methylation_data_transposed.index.isin(CpG_sites_within_distance_and_between_positions)]

        # Transposing the 'CpG_sites_within_distance_and_between_positions_from_methylation' DataFrame such that the CpG
        # sites are the columns and the samples the rows (as this is the input expected by most machine learning algorithms).
        methylation_data_close_to_gene = CpG_sites_within_distance_and_between_positions_from_methylation.T
        
        # Inserting the sample names currently present as the indices as the first column by calling the function 'insert()'
        # and dropping the indices by calling the function 'drop()'.
        methylation_data_close_to_gene.insert(0, 'Samples', methylation_data_close_to_gene.index)
        methylation_data_close_to_gene = methylation_data_close_to_gene.reset_index(drop=True)

    else:
        sys.exit("The gene is not of the correct format and is not present in the gene expression data.")
        
    return methylation_data_close_to_gene

### The <i>'get_methylation_data_close_to_gene_and_with_higher_correlation_than_threshold(methylation_data, gene_expression_data_current_gene, distance=500000, threshold=0.3)'</i> function:

The function 'get_methylation_data_close_to_gene_and_with_higher_correlation_than_threshold(methylation_data, gene_expression_data_current_gene, distance=500000, threshold=0.3)' retrieves a subset of the passed on 'methylation_data' DataFrame where the CpG sites are within a distance of 'distance' from the gene and for which each of the CpG sites has a correlation coefficient higher than the threshold of 'threshold' with the gene.

In [2]:
# This function retrieves a subset of the passed on 'methylation_data' DataFrame where the CpG sites are within a distance 
# of 'distance' from the gene and for which each of the CpG sites has a correlation coefficient higher than the threshold 
# of 'threshold' with the gene.
def get_methylation_data_close_to_gene_and_with_higher_correlation_than_threshold(methylation_data, gene_expression_data_current_gene, distance=500000, threshold=0.3):
    
    # Retrieving the gene of the 'gene_expression_data_current_gene' DataFrame.
    gene = gene_expression_data_current_gene.columns[1]
    
    # Checking whether the gene is of the correct format and the correct length.
    if re.match('^ENSG00000\d{6}$', gene) and (len(gene) == 15):
        # Retrieving the chromosome, start position and end position of the gene.
        chromosome = (genes_location_data[genes_location_data['gene'] == gene])['chromosome'].iloc[0]
        start_position = (genes_location_data[genes_location_data['gene'] == gene])['start_position'].iloc[0]
        end_position = (genes_location_data[genes_location_data['gene'] == gene])['end_position'].iloc[0]

        # Retrieving the CpG sites from the 'CpG_sites_location_data' DataFrame which have as their chromosome the 'chromosome'
        # variable. 
        CpG_sites_from_chromosome = CpG_sites_location_data[CpG_sites_location_data["chromosome"] == chromosome]
        
        # Retrieving the CpG sites in the 'CpG_sites_from_chromosome' DataFrame of which the positions lie within the 
        # 'distance' from the 'start_position' and 'end_position' of the gene'. Of course, the CpG sites that are located
        # inbetween these two positions are included as well.
        CpG_sites_within_distance_and_between_positions = CpG_sites_from_chromosome[((CpG_sites_from_chromosome["position"] >= start_position - distance) & 
                                            (CpG_sites_from_chromosome["position"] <= end_position + distance)) |
                                            ((CpG_sites_from_chromosome["position"] > start_position) & 
                                            (CpG_sites_from_chromosome["position"] < end_position))]['CpG_site'].tolist()
        
        # Transposing the 'methylation_data' DataFrame as this allows for faster indexing as we can then index the rows
        # instead of the columns by calling the property 'T'.
        methylation_data_transposed = methylation_data.T
        
        # Setting the first column to be the index and dropping that column by respectively calling the functions 
        # 'set_axis()' and 'drop()'.
        methylation_data_transposed = methylation_data_transposed.set_axis(methylation_data_transposed.iloc[0], axis=1)
        methylation_data_transposed = methylation_data_transposed.drop(methylation_data_transposed.index[0])
        
        # Retrieving a subset of the rows of the 'methylation_data_transposed' DataFrame by checking whether the CpG sites
        # appear within the 'CpG_sites_within_distance_and_between_positions' DataFrame. This can be achieved by calling the 
        # function 'isin()'.
        CpG_sites_within_distance_and_between_positions_from_methylation = methylation_data_transposed[
                                            methylation_data_transposed.index.isin(CpG_sites_within_distance_and_between_positions)]
        
        # Retrieving the data of the current 'gene' in the form of a list by calling the property 'values' and the function 
        # 'tolist()'.
        gene_data = gene_expression_data_current_gene[gene].values.tolist()

        # Creating an empty list that will later store all of the filtered rows.
        filtered_rows = []

        # Looping over every CpG site within the 'CpG_sites_within_distance_and_between_positions_from_methylation' DataFrame
        # and checking whether its correlation coefficient with the current 'gene' is higher than the 'threshold'.
        for CpG_site, row in CpG_sites_within_distance_and_between_positions_from_methylation.iterrows():
            CpG_site_data = CpG_sites_within_distance_and_between_positions_from_methylation.loc[CpG_site].values.tolist()
            
            # Calculating the Pearson's correlation coefficient between the 'gene_data' and the 'CpG_site_data' by calling
            # the function 'corrcoef()' from the 'numpy' library. Since a correlation matrix is returned, we retrieve one of
            # the non-diagonal elements (which are all the same).
            correlation_coefficient = (np.corrcoef(gene_data, CpG_site_data))[0, 1]
            
            # If this 'correlation_coefficient' (its absolute value retrieved by calling the function 'np.abs()') is higher 
            # than or equal to the 'threshold', we keep it.
            if np.abs(correlation_coefficient) >= threshold:
                filtered_rows.append(row)
        
        # Converting the 'filtered_rows' list to a new DataFrame by calling the constructor 'DataFrame()'.
        CpG_sites_within_distance_and_between_positions_from_methylation_filtered = pd.DataFrame(filtered_rows)
                
        # Transposing the 'CpG_sites_within_distance_and_between_positions_from_methylation_filtered' DataFrame such that 
        # the CpG sites are the columns and the samples the rows (as this is the input expected by most machine learning 
        # algorithms).
        methylation_data_close_to_gene = CpG_sites_within_distance_and_between_positions_from_methylation_filtered.T
        
        # Inserting the sample names currently present as the indices as the first column by calling the function 'insert()'
        # and dropping the indices by calling the function 'drop()'.
        methylation_data_close_to_gene.insert(0, 'Samples', methylation_data_close_to_gene.index)
        methylation_data_close_to_gene = methylation_data_close_to_gene.reset_index(drop=True)

    else:
        sys.exit("The gene is not of the correct format and is not present in the gene expression data.")
        
    return methylation_data_close_to_gene

### The <i>'get_methylation_data_close_to_chromosome_position(methylation_data, chromosome, position, distance=500000)'</i> function:

The function 'get_methylation_data_close_to_chromosome_position(methylation_data, chromosome, position, distance=500000)' retrieves a subset of the passed on 'methylation_data' DataFrame where the CpG sites are within a distance of 'distance' from the 'position' on the 'chromosome'.

In [10]:
# Retrieving a subset of the passed on 'methylation_data' DataFrame where the CpG sites are within a distance of 'distance' 
# from the 'position' on the 'chromosome'.
def get_methylation_data_close_to_chromosome_position(methylation_data, chromosome, position, distance=500000):
    
    # Checking whether the chromosome is a value between 1 and 22.
    pattern = r'^chr([1-9]|1[0-9]|2[0-2])$'
    if not re.match(pattern, str(chromosome)):
        if (isinstance(chromosome, int) and 1 <= chromosome <= 22) or (isinstance(chromosome, str) and chromosome.isdigit() and 1 <= int(chromosome) <= 22):
            chromosome = "chr" + str(chromosome)
        else:
            sys.exit("The chromosome given does not fall between 1 and 22.")
    
    # Retrieving the length of the chromosome.
    chromosome_length = chromosomes_length_data.loc[chromosomes_length_data["chromosome"] == chromosome, "length"].values[0]
    
    # Checking whether the position is a value between 1 and the length of the chromosome.
    if (isinstance(position, int) and 1 <= position <= chromosome_length) or (isinstance(position, str) and position.isdigit() and 1 <= int(position) <= chromosome_length):
        position = int(position)
    else:
        sys.exit("The position given does not fall between 1 and the length of the chromosome.")
    
    # Retrieving the CpG sites from the 'CpG_sites_location_data' DataFrame which have as their chromosome the 'chromosome'
    # variable. 
    CpG_sites_from_chromosome = CpG_sites_location_data[CpG_sites_location_data["chromosome"] == chromosome]
    
    # Retrieving the CpG sites in the 'CpG_sites_from_chromosome' DataFrame of which the positions lie within the 
    # 'distance' from the 'start_position' and 'end_position' of the 'gene'. Of course, the CpG sites that are located
    # inbetween these two positions are included as well.
    CpG_sites_within_distance_and_between_positions = CpG_sites_from_chromosome[((CpG_sites_from_chromosome["position"] >= position - distance) & 
                                        (CpG_sites_from_chromosome["position"] <= position + distance))]['CpG_site']
    
    # Transposing the 'methylation_data' DataFrame as this allows for faster indexing as we can then index the rows
    # instead of the columns by calling the property 'T'.
    methylation_data_transposed = methylation_data.T

    # Setting the first column to be the index and dropping that column by respectively calling the functions 
    # 'set_axis()' and 'drop()'.
    methylation_data_transposed = methylation_data_transposed.set_axis(methylation_data_transposed.iloc[0], axis=1)
    methylation_data_transposed = methylation_data_transposed.drop(methylation_data_transposed.index[0])

    # Retrieving a subset of the rows of the 'methylation_data_transposed' DataFrame by checking whether the CpG sites
    # appear within the 'CpG_sites_within_distance_and_between_positions' DataFrame. This can be achieved by calling the 
    # function 'isin()'.
    CpG_sites_within_distance_and_between_positions_from_methylation = methylation_data_transposed[
                                        methylation_data_transposed.index.isin(CpG_sites_within_distance_and_between_positions)]

    # Transposing the 'CpG_sites_within_distance_and_between_positions_from_methylation' DataFrame such that the CpG
    # sites are the columns and the samples the rows (as this is the input expected by most machine learning algorithms).
    methylation_data_close_to_chromosome_position = CpG_sites_within_distance_and_between_positions_from_methylation.T

    # Inserting the sample names currently present as the indices as the first column by calling the function 'insert()'
    # and dropping the indices by calling the function 'drop()'.
    methylation_data_close_to_chromosome_position.insert(0, 'Samples', methylation_data_close_to_chromosome_position.index)
    methylation_data_close_to_chromosome_position = methylation_data_close_to_chromosome_position.reset_index(drop=True)
    
    return methylation_data_close_to_chromosome_position    

### The <i>'get_gene_expression_data_from_chromosome(gene_expression_data, chromosome)'</i> function

The function 'get_gene_expression_data_from_chromosome(gene_expression_data, chromosome)' retrieves a subset of the passed on 'gene_expression_data' DataFrame where the genes are present on the passed on 'chromosome'.

In [11]:
# This function retrieves a subset of the passed on 'gene_expression_data' DataFrame where the genes are present on the 
# passed on 'chromosome'.
def get_gene_expression_data_from_chromosome(gene_expression_data, chromosome):
    
    # Checking whether the chromosome is a value between 1 and 22.
    pattern = r'^chr([1-9]|1[0-9]|2[0-2])$'
    if not re.match(pattern, str(chromosome)):
        if (isinstance(chromosome, int) and 1 <= chromosome <= 22) or (isinstance(chromosome, str) and chromosome.isdigit() and 1 <= int(chromosome) <= 22):
            chromosome = "chr" + str(chromosome)
        else:
            sys.exit("The chromosome given does not fall between 1 and 22.")
            
    # Retrieving the genes from the 'genes_location_data' DataFrame which have as their chromosome the 'chromosome'
    # variable.
    genes_from_chromosome = genes_location_data[genes_location_data["chromosome"] == chromosome]['gene']
    genes_chromosome = [column for column in gene_expression_data.columns[1:] if column in genes_from_chromosome.tolist()]
    
    # Selecting the subset of the 'gene_expression_data' DataFrame where the genes are in the 'genes_chromosome' 
    # variable.
    selected_columns = genes_chromosome
    selected_columns.insert(0, 'Samples')
    gene_expression_data_from_chromosome = gene_expression_data[selected_columns]
    
    return gene_expression_data_from_chromosome

### The <i>'get_gene_expression_data_with_2_CpG_sites_within_distance(gene_expression_data, methylation_data, distance)'</i> function

The function 'get_gene_expression_data_with_2_CpG_sites_within_distance(gene_expression_data, methylation_data, distance)' retrieves a subset of the passed on 'gene_expression_data' DataFrame where the genes have at least 2 CpG sites present in the 'methylation_data' DataFrame located within a distance of 'distance' from the current gene in both directions.

In [12]:
# This function retrieves a subset of the passed on 'gene_expression_data' DataFrame where the genes have at least 2 CpG 
# sites present in the 'methylation_data' DataFrame located within a distance of 'distance' from the current gene in both 
# directions.
def get_gene_expression_data_with_2_CpG_sites_within_distance(gene_expression_data, methylation_data, distance):
    
    # The list of genes to be removed featuring all the genes for at most 1 CpG site is present within a distance of 
    # 'distance' in both directions from the current gene.
    genes_to_remove = []

    # Looping over every gene present in the 'gene_expression_data' DataFrame.
    for gene in gene_expression_data.columns[1:]:
        # Checking whether the gene is of the correct format and the correct length.
        if re.match('^ENSG00000\d{6}$', gene) and (len(gene) == 15):
            # Retrieving the chromosome, start position and end position of the gene.
            chromosome = (genes_location_data[genes_location_data['gene'] == gene])['chromosome'].iloc[0]
            start_position = (genes_location_data[genes_location_data['gene'] == gene])['start_position'].iloc[0]
            end_position = (genes_location_data[genes_location_data['gene'] == gene])['end_position'].iloc[0]

            # Retrieving the CpG sites from the 'CpG_sites_location_data' DataFrame which have as their chromosome the 
            # 'chromosome' variable. 
            CpG_sites_from_chromosome = CpG_sites_location_data[CpG_sites_location_data["chromosome"] == chromosome]

            # Retrieving the CpG sites in the 'CpG_sites_from_chromosome' DataFrame of which the positions lie within the 
            # 'distance' from the 'start_position' and 'end_position' of the gene'. Of course, the CpG sites that are located
            # inbetween these two positions are included as well.
            CpG_sites_within_distance_and_between_positions = CpG_sites_from_chromosome[((CpG_sites_from_chromosome["position"] >= start_position - distance) & 
                                                (CpG_sites_from_chromosome["position"] <= end_position + distance)) |
                                                ((CpG_sites_from_chromosome["position"] > start_position) & 
                                                (CpG_sites_from_chromosome["position"] < end_position))]['CpG_site']
            CpG_sites_within_distance_and_between_positions_from_methylation = [column for column in methylation_data.columns[1:] if column in CpG_sites_within_distance_and_between_positions.tolist()]

            # Retrieving the number of CpG sites present within the 'CpG_sites_within_distance_and_between_positions_from_methylation'.
            number_of_CpG_sites = len(CpG_sites_within_distance_and_between_positions_from_methylation)

            # Checking whether the 'number_of_CpG_sites' is smaller than 2. If this is the case, the gene is added to the list 
            # 'genes_to_remove'.
            if number_of_CpG_sites < 2:
                genes_to_remove.append(gene)

        else:
            sys.exit("The gene is not of the correct format and is not present in the gene expression data.")
    
    
    # Selecting the subset of the 'gene_expression_data' DataFrame where the genes are not in the 'genes_to_remove' list. 
    selected_columns = [gene for gene in gene_expression_data.columns[1:] if gene not in genes_to_remove]
    selected_columns.insert(0, 'Samples')
    gene_expression_data_filtered = gene_expression_data[selected_columns]
    
    return gene_expression_data_filtered

### The <i>'get_gene_expression_data_with_2_CpG_sites_correlated_higher_than_threshold(gene_expression_data, methylation_data, threshold)'</i> function

The function 'get_gene_expression_data_with_2_CpG_sites_correlated_higher_than_threshold(gene_expression_data, methylation_data, threshold)' retrieves a subset of the passed on 'gene_expression_data' DataFrame where the genes have at least 2 CpG sites present in the 'methylation_data' DataFrame which have a higher or equal correlation than the 'threshold' with the current gene. 

In [13]:
# This function retrieves a subset of the passed on 'gene_expression_data' DataFrame where the genes have at least 2 CpG 
# sites present in the 'methylation_data' DataFrame which have a higher or equal correlation than the 'threshold' with the 
# current gene.
def get_gene_expression_data_with_2_CpG_sites_correlated_higher_than_threshold(gene_expression_data, methylation_data, threshold):
    
    # The list of genes to be removed featuring all the genes for which there is at most 1 CpG site that has a higher or 
    # equal correlation coefficient with the gene than the threshold of 'threshold' for each of the genes.
    genes_to_remove = []

    # Looping over every gene present in the 'gene_expression_data' DataFrame.
    for gene in gene_expression_data.columns[1:]:
        # Retrieving the data for all 64 samples for the 'gene'.
        gene_data = gene_expression_data[gene].values.tolist()
        
        # Setting a counter which will keep track of the number of CpG sites that have a higher or equal correlation 
        # coefficient with the 'gene' than the 'threshold'.
        higher_than_threshold_counter = 0
        
        if re.match('^ENSG00000\d{6}$', gene) and (len(gene) == 15):
            # Calculating the Pearson's correlation coefficient between the 'gene' and every CpG site present within the 
            # 'methylation_data' DataFrame.
            for CpG_site in methylation_data.columns[1:]:
                # Retrieving the data for all 64 samples for the current 'CpG_site'.
                CpG_site_data = methylation_data[CpG_site].values.tolist()

                # Calculating the Pearson's correlation coefficient between the 'gene_data' and the 'CpG_site_data' by 
                # calling the function 'corrcoef()' from the 'numpy' library. Since a correlation matrix is returned, we 
                # retrieve one of the non-diagonal elements (which are all the same).
                correlation_coefficient = (np.corrcoef(gene_data, CpG_site_data))[0, 1]

                # If this 'correlation_coefficient' (its absolute value retrieved by calling the function 'np.abs()') is 
                # higher than or equal to the 'threshold', we increase the 'higher_than_threshold_counter' counter by one.
                if np.abs(correlation_coefficient) >= threshold:
                    higher_than_threshold_counter = higher_than_threshold_counter + 1

            # Checking whether the 'higher_than_threshold_counter' is smaller than 2. If this is the case, the gene is added to the list 
            # 'genes_to_remove'.
            if higher_than_threshold_counter < 2:
                genes_to_remove.append(gene)
                
    # Selecting the subset of the 'gene_expression_data' DataFrame where the genes are not in the 'genes_to_remove' list. 
    selected_columns = [gene for gene in gene_expression_data.columns[1:] if gene not in genes_to_remove]
    selected_columns.insert(0, 'Samples')
    gene_expression_data_filtered = gene_expression_data[selected_columns]
    
    return gene_expression_data_filtered