# Checking Resulting Methylation Data and Gene Expression Data File
### Laurence Nickel (i6257119)

Libraries used: 
* pandas (version: '1.2.4')
* sys (version: '3.8.8')
* os (version: '3.8.8')

## Introduction

Within this notebook, the methylation data files and the gene expression data files are checked whether they follow the same structure such that they can be freely used within the machine learning techniques applied later on. This includes updating the datasets such that they all contain the same samples and checking whether the samples appear in the same order across the 4 datasets (2 for methylation data and 2 for gene expression data).

### Importing libraries

Before we can start to check the methylation data files and the gene expression data files, we should first import some libraries that will be used throughout this notebook.

In [1]:
print("Starting the importing of the libraries...")


import pandas as pd
import sys
import os


print("Finishing the installing of the libraries.")

Starting the importing of the libraries...
Finishing the installing of the libraries.


Now that all the libraries have been imported, we can verify that these libraries have been loaded into this notebook by calling the version property of the library.

In [2]:
# Retrieving the version of the libraries to verify they have been correctly loaded into this notebook.
print("The library 'pd' (pandas) has been loaded into the notebook with its version being:")
print(pd.__version__)

print("\nThe library 'sys' has been loaded into the notebook with its version being:")
print(sys.version)

The library 'pd' (pandas) has been loaded into the notebook with its version being:
1.2.4

The library 'sys' has been loaded into the notebook with its version being:
3.8.8 (default, Apr 13 2021, 15:08:03) [MSC v.1916 64 bit (AMD64)]


### Defining the data directory

In addition, we also need to define our data directory from which the files will be loaded and to which the resulting files will be stored. Please mind that this needs to be changed to the desired directory to be able to work with the data directory.

In [3]:
data_directory_combined_cleaned_files = "C:/Users/laure/OneDrive/Documenten/Bachelor Thesis Data/combined_cleaned_data"

## Loading Methylation Data Files

Within this section, we can load the methylation data files into this notebook by calling the function 'pd.read_csv()'.

#### Loading the 'methylation_data_cleaned.csv' file into this notebook

In [4]:
# Loading the 'methylation_data_cleaned.csv' file into this notebook by calling the function 'pd.read_csv()'.
methylation_data = pd.read_csv(data_directory_combined_cleaned_files + "/" + "methylation_data_cleaned.csv")

print("The 'methylation_data' DataFrame containing the data from the 'methylation_data_cleaned.csv' file:")
methylation_data

The 'methylation_data' DataFrame containing the data from the 'methylation_data_cleaned.csv' file:


Unnamed: 0,CpG sites,TCGA-06-0211-01A-01,TCGA-26-5136-01B-01,TCGA-06-1804-01A-01,TCGA-06-5408-01A-01,TCGA-19-4065-01A-01,TCGA-14-0736-02A-01,TCGA-06-5415-01A-01,TCGA-06-0221-02A-11,TCGA-19-5960-01A-11,...,TCGA-06-0190-01A-01,TCGA-06-0171-02A-11,TCGA-06-5418-01A-01,TCGA-14-1402-02A-01,TCGA-26-5134-01A-01,TCGA-28-5208-01A-01,TCGA-12-5295-01A-01,TCGA-76-4931-01A-01,TCGA-06-0190-02A-01,TCGA-76-4929-01A-01
0,cg00050873,0.598401,0.696228,0.693775,0.672346,0.474384,0.695819,0.844931,0.616233,0.472024,...,0.758095,0.872146,0.681442,0.797791,0.785146,0.825541,0.646908,0.705596,0.611971,0.632039
1,cg00212031,0.148621,0.116158,0.114274,0.045200,0.041728,0.034927,0.033947,0.035452,0.019626,...,0.026595,0.036259,0.218612,0.172446,0.031262,0.331090,0.241757,0.121716,0.031320,0.478576
2,cg00213748,0.836599,0.637830,0.665190,0.589482,0.081013,0.607042,0.748129,0.632638,0.339565,...,0.564641,0.846239,0.580089,0.726151,0.893503,0.778891,0.614615,0.529745,0.516280,0.622359
3,cg00455876,0.642416,0.693596,0.650437,0.639665,0.501514,0.758325,0.783247,0.664261,0.900718,...,0.671254,0.814129,0.700934,0.454763,0.830252,0.935513,0.489857,0.661220,0.664484,0.691125
4,cg01707559,0.671818,0.312592,0.554964,0.315414,0.448499,0.028153,0.051755,0.030808,0.023221,...,0.026048,0.504077,0.377020,0.551934,0.039638,0.040118,0.516529,0.314803,0.033202,0.495401
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
280995,ch.22.707049R,0.023967,0.093547,0.091928,0.042238,0.030906,0.042918,0.077318,0.057004,0.090479,...,0.023218,0.030588,0.132274,0.075126,0.048225,0.220282,0.078535,0.117766,0.029380,0.179526
280996,ch.22.728807R,0.074757,0.103989,0.064681,0.130256,0.072425,0.093775,0.078673,0.108003,0.152517,...,0.069550,0.084550,0.101242,0.090866,0.087217,0.123833,0.127836,0.064936,0.076466,0.146808
280997,ch.22.734399R,0.205177,0.220414,0.291399,0.315832,0.144253,0.206865,0.297240,0.270165,0.303982,...,0.169489,0.123179,0.252650,0.265653,0.182423,0.248715,0.267856,0.357379,0.164743,0.280876
280998,ch.22.772318F,0.032142,0.143433,0.196813,0.045382,0.030782,0.052378,0.086724,0.061200,0.117800,...,0.033715,0.024192,0.083278,0.076331,0.058467,0.118654,0.067077,0.050092,0.032427,0.157513


#### Loading the 'methylation_data_cleaned_M_transformed.csv' file into this notebook

In [5]:
# Loading the 'methylation_data_cleaned_M_transformed.csv' file into this notebook by calling the function 'pd.read_csv()'.
methylation_data_M_transformed = pd.read_csv(data_directory_combined_cleaned_files + "/" + "methylation_data_cleaned_M_transformed.csv")

print("The 'methylation_data_M_transformed' DataFrame containing the data from the 'methylation_data_cleaned_M_transformed.csv' file:")
methylation_data_M_transformed

The 'methylation_data_M_transformed' DataFrame containing the data from the 'methylation_data_cleaned_M_transformed.csv' file:


Unnamed: 0,CpG sites,TCGA-06-0211-01A-01,TCGA-26-5136-01B-01,TCGA-06-1804-01A-01,TCGA-06-5408-01A-01,TCGA-19-4065-01A-01,TCGA-14-0736-02A-01,TCGA-06-5415-01A-01,TCGA-06-0221-02A-11,TCGA-19-5960-01A-11,...,TCGA-06-0190-01A-01,TCGA-06-0171-02A-11,TCGA-06-5418-01A-01,TCGA-14-1402-02A-01,TCGA-26-5134-01A-01,TCGA-28-5208-01A-01,TCGA-12-5295-01A-01,TCGA-76-4931-01A-01,TCGA-06-0190-02A-01,TCGA-76-4929-01A-01
0,cg00050873,0.575358,1.196570,1.179877,1.037031,-0.147951,1.193780,2.445926,0.683243,-0.161612,...,1.647939,2.770073,1.097036,1.980160,1.869605,2.242452,0.873516,1.261048,0.657298,0.780458
1,cg00212031,-2.518160,-2.927707,-2.954364,-4.400794,-4.521361,-4.788215,-4.830728,-4.765906,-5.642500,...,-5.193802,-4.732231,-1.837664,-2.262707,-4.953635,-1.014591,-1.649103,-2.851172,-4.950884,-0.123710
2,cg00213748,2.356115,0.816507,0.990424,0.522004,-3.503816,0.627419,1.570600,0.784178,-0.959728,...,0.375128,2.460377,0.466191,1.406889,3.068663,1.816665,0.673383,0.171852,0.093984,0.720734
3,cg00455876,0.845228,1.178662,0.895856,0.827976,0.008734,1.649747,1.853414,0.984410,3.181472,...,1.029885,2.130957,1.228815,-0.261768,2.290150,3.858676,-0.058539,0.964784,0.985853,1.161921
4,cg01707559,1.033575,-1.136884,0.318470,-1.117984,-0.298258,-5.109380,-4.195476,-4.975419,-5.394501,...,-5.224614,0.023525,-0.724543,0.300784,-4.598641,-4.580530,0.095419,-1.122069,-4.863882,-0.026542
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
280995,ch.22.707049R,-5.347784,-3.276470,-3.304231,-4.503067,-4.970658,-4.478993,-3.576952,-4.048114,-3.329448,...,-5.394690,-4.986055,-2.713710,-3.621882,-4.302766,-1.823602,-3.552513,-2.905236,-5.045978,-2.192265
280996,ch.22.728807R,-3.629545,-3.107092,-3.854037,-2.739243,-3.678896,-3.272591,-3.549764,-3.045972,-2.474210,...,-3.741805,-3.436608,-3.150129,-3.322677,-3.387592,-2.822816,-2.770308,-3.847968,-3.594284,-2.538937
280997,ch.22.734399R,-1.953764,-1.822492,-1.281976,-1.115193,-2.568587,-1.938879,-1.241406,-1.433729,-1.195136,...,-2.292808,-2.831527,-1.564646,-1.466921,-2.164064,-1.594869,-1.450667,-0.846516,-2.342003,-1.356304
280998,ch.22.772318F,-4.912269,-2.578194,-2.028914,-4.394741,-4.976683,-4.177285,-3.396540,-3.939215,-2.904770,...,-4.840999,-5.334004,-3.460468,-3.597027,-4.009312,-2.892950,-3.797869,-4.245142,-4.899087,-2.419188


## Loading Gene Expression Data Files

Within this section, we can load the gene expression data files into this notebook by calling the function 'pd.read_csv()'.

#### Loading the 'gene_expression_data_cleaned.csv' file into this notebook

In [6]:
# Loading the 'gene_expression_data_cleaned.csv' file into this notebook by calling the function 'pd.read_csv()'.
gene_expression_data = pd.read_csv(data_directory_combined_cleaned_files + "/" + "gene_expression_data_cleaned.csv")

print("The 'gene_expression_data' DataFrame containing the data from the 'gene_expression_data_cleaned.csv' file:")
gene_expression_data

The 'gene_expression_data' DataFrame containing the data from the 'gene_expression_data_cleaned.csv' file:


Unnamed: 0,Gene ID,Gene Name,TCGA-06-0221-02A-11,TCGA-76-4928-01B-01,TCGA-26-5132-01A-01,TCGA-28-5218-01A-01,TCGA-06-0210-01A-01,TCGA-32-5222-01A-01,TCGA-06-5859-01A-01,TCGA-06-5413-01A-01,...,TCGA-32-1980-01A-01,TCGA-19-1389-02A-21,TCGA-19-5960-01A-11,TCGA-06-5418-01A-01,TCGA-06-1804-01A-01,TCGA-15-1444-01A-02,TCGA-06-0152-02A-01,TCGA-06-5415-01A-01,TCGA-19-0957-02A-11,TCGA-76-4929-01A-01
0,ENSG00000000003.15,TSPAN6,34.3045,66.0662,71.9519,76.9722,81.7969,105.1259,106.8657,146.2413,...,24.3036,56.1192,110.8961,84.0034,51.8153,79.3015,50.9843,95.7581,34.2730,117.6090
1,ENSG00000000005.6,TNMD,0.1278,0.2063,1.1969,212.6327,0.1334,0.3746,6.5304,2.0149,...,0.4089,0.0000,1.1017,0.8864,0.3312,0.6253,0.1591,0.8233,0.2051,5.6656
2,ENSG00000000419.13,DPM1,65.3189,89.2847,84.2576,85.4885,80.0162,143.1418,96.0231,132.6740,...,74.3346,75.4281,108.3149,105.1454,65.9500,61.3413,98.1366,117.6062,46.9677,90.8538
3,ENSG00000000457.14,SCYL3,7.6179,4.2106,6.3164,8.0810,4.8043,7.0083,10.8756,7.6998,...,5.0076,3.8505,9.8632,4.8304,5.1332,5.4407,6.3809,7.4290,4.9373,7.6404
4,ENSG00000000460.17,C1orf112,6.8873,3.8754,5.5325,4.7098,2.8904,7.2234,6.0676,7.6639,...,3.2461,2.5348,8.0895,3.8546,3.2281,5.1131,5.7963,5.2377,2.9793,2.3827
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
20432,ENSG00000288617.1,AL049777.1,6.8471,2.7489,7.7796,0.2672,8.4065,5.0177,8.4420,7.7698,...,1.1456,0.7985,4.8613,7.1844,12.4826,3.1376,10.8429,9.5819,2.2628,0.2630
20433,ENSG00000288658.1,AC010980.1,2.7653,0.4309,0.6472,1.3285,0.1769,0.3028,2.2793,1.1823,...,3.0511,1.1028,2.6306,0.1297,1.0120,0.4889,2.2429,1.3863,1.0459,3.6902
20434,ENSG00000288667.1,AC078856.1,0.0000,0.0000,0.0000,0.0000,0.0000,0.0000,0.0000,0.0000,...,0.0000,1.3410,0.0000,0.0000,0.0000,0.0000,0.0000,0.0000,1.2408,0.0000
20435,ENSG00000288670.1,AL592295.6,19.4144,6.9073,7.2689,13.3304,7.5160,9.1788,17.0350,6.3300,...,11.1699,9.1893,22.6318,9.5830,10.3947,16.3091,12.0011,13.0884,15.5815,9.9713


#### Loading the 'gene_expression_data_cleaned_log2_transformed.csv' file into this notebook

In [7]:
# Loading the 'gene_expression_data_cleaned_log2_transformed.csv' file into this notebook by calling the function 'pd.read_csv()'.
gene_expression_data_log2_transformed = pd.read_csv(data_directory_combined_cleaned_files + "/" + "gene_expression_data_cleaned_log2_transformed.csv")

print("The 'gene_expression_data_log2_transformed' DataFrame containing the data from the 'gene_expression_data_cleaned_log2_transformed.csv' file:")
gene_expression_data_log2_transformed

The 'gene_expression_data_log2_transformed' DataFrame containing the data from the 'gene_expression_data_cleaned_log2_transformed.csv' file:


Unnamed: 0,Gene ID,Gene Name,TCGA-06-0221-02A-11,TCGA-76-4928-01B-01,TCGA-26-5132-01A-01,TCGA-28-5218-01A-01,TCGA-06-0210-01A-01,TCGA-32-5222-01A-01,TCGA-06-5859-01A-01,TCGA-06-5413-01A-01,...,TCGA-32-1980-01A-01,TCGA-19-1389-02A-21,TCGA-19-5960-01A-11,TCGA-06-5418-01A-01,TCGA-06-1804-01A-01,TCGA-15-1444-01A-02,TCGA-06-0152-02A-01,TCGA-06-5415-01A-01,TCGA-19-0957-02A-11,TCGA-76-4929-01A-01
0,ENSG00000000003.15,TSPAN6,5.141780,6.067514,6.188874,6.284888,6.371505,6.729633,6.753092,7.202039,...,4.661271,5.835904,6.806016,6.409449,5.722884,6.327355,5.700004,6.596311,5.140492,6.890070
1,ENSG00000000005.6,TNMD,0.173511,0.270589,1.135469,7.738989,0.180657,0.459012,2.912727,1.592110,...,0.494569,0.000000,1.071557,0.915636,0.412727,0.700706,0.213005,0.866552,0.269153,2.736735
2,ENSG00000000419.13,DPM1,6.051348,6.496410,6.413757,6.434436,6.340139,7.171345,6.600256,7.062575,...,6.235241,6.256031,6.772346,6.729898,6.065012,5.962116,6.631346,6.890036,5.583991,6.521268
3,ENSG00000000457.14,SCYL3,3.107336,2.381450,2.871134,3.182851,2.537122,3.001496,3.569929,3.120982,...,2.586789,2.278133,3.441377,2.543595,2.616640,2.687217,2.883797,3.075361,2.569807,3.111098
4,ENSG00000000460.17,C1orf112,2.979532,2.285521,2.707635,2.513440,1.959918,3.039735,2.821220,3.115017,...,2.086138,1.821629,3.184201,2.279352,2.080009,2.611904,2.764750,2.641014,1.992515,1.758175
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
20432,ENSG00000288617.1,AL049777.1,2.972160,1.906467,3.134155,0.341644,3.233658,2.589212,3.239092,3.132544,...,1.101381,0.846794,2.551221,3.032877,3.753027,2.048794,3.565950,3.403527,1.706111,0.336855
20433,ENSG00000288658.1,AC010980.1,1.912765,0.516923,0.720016,1.219401,0.234992,0.381616,1.713388,1.125849,...,2.018314,1.072312,1.860208,0.175940,1.008630,0.574247,1.697285,1.254775,1.032736,2.229649
20434,ENSG00000288667.1,AC078856.1,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,1.227125,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,1.164014,0.000000
20435,ENSG00000288670.1,AL592295.6,4.351515,2.983185,3.047695,3.841007,3.090176,3.347496,4.172728,2.873813,...,3.605245,3.348983,4.562658,3.403677,3.510291,4.113459,3.700562,3.816436,4.051503,3.455663


## Checking Whether the Samples are the Same in the Datasets

Something that we could check is whether the samples are the same within all four datasets. This is probably not the case as we removed a sample for the methylation data but did not do this for the gene expression data. Therefore, it is quite likely that we will have to remove that sample from the two gene expression datasets as well.

To discover whether the samples are the same within all four datasets, we can retrieve the columns from all the datasets and check whether the number of columns (number of samples) is the same for the datasets. Mind that the number of columns being equal for all datasets does not necessarily mean that the samples are the same as well. 

In [8]:
# Retrieving the columns of the four DataFrames. Since in the 'methylation_data' and 'methylation_data_M_transformed' 
# DataFrames the first column contains the CpG sites, this one is excluded. Since in the 'gene_expression_data' and 
# 'gene_expression_data_log2_transformed' DataFrame the first column contains the id of the gene and the second column 
# contains the name of the gene, these are excluded.
methylation_data_columns = methylation_data.columns[1:].tolist()
methylation_data_M_transformed_columns = methylation_data_M_transformed.columns[1:].tolist()
gene_expression_data_columns = gene_expression_data.columns[2:].tolist()
gene_expression_data_log2_transformed_columns = gene_expression_data_log2_transformed.columns[2:].tolist()

# Checking whether the number of columns is equal for the 4 DataFrames. 
if len(methylation_data_columns) == len(methylation_data_M_transformed_columns) == len(gene_expression_data_columns) == len(gene_expression_data_log2_transformed_columns):
    print("The number of samples present in both files is equal.")
else:
    print("The number of samples present in both files is not equal.")
    print("The number of samples present within the 'methylation_data' DataFrame: " + str(len(methylation_data_columns)))
    print("The number of samples present within the 'methylation_data_M_transformed' DataFrame: " + str(len(methylation_data_M_transformed_columns)))
    print("The number of samples present within the 'gene_expression_data' DataFrame: " + str(len(gene_expression_data_columns)))
    print("The number of samples present within the 'gene_expression_data_log2_transformed' DataFrame: " + str(len(gene_expression_data_log2_transformed_columns)))

The number of samples present in both files is not equal.
The number of samples present within the 'methylation_data' DataFrame: 64
The number of samples present within the 'methylation_data_M_transformed' DataFrame: 64
The number of samples present within the 'gene_expression_data' DataFrame: 65
The number of samples present within the 'gene_expression_data_log2_transformed' DataFrame: 65


Indeed as expected, the number of samples is not the same for every of the four datasets which is caused by us having removed a sample for the methylation data but not having done this for the gene expression data.

The next step then is to identify which of the samples are not present in all the datasets. This can be achieved by looping all over the samples in the datasets and by making use of the functions 'intersection_update()' and 'extend()'.

In [9]:
# Creating a list of all the dataset samples.
all_samples = [methylation_data_columns, methylation_data_M_transformed_columns, gene_expression_data_columns, gene_expression_data_log2_transformed_columns]

# Creating a set where the samples that are common will be stored and initializing it with the first set of samples present 
# in the 'all_samples' list.
common_samples = set(all_samples[0])

# Looping over all the sets of samples present within the 'all_samples' list and updating the 'common_samples' set by 
# calling the function 'intersection_update()' for each of the sets of samples.
for set_of_samples in all_samples[1:]:
    common_samples.intersection_update(set_of_samples)
    
# Looping over all the sets of samples and adding the samples to the 'unique_samples' list which do not occur in all the 
# sets of samples. To achieve this, the function 'extend()' can be called.
unique_samples = []
for set_of_samples in all_samples:
    unique_samples.extend(set(set_of_samples) - common_samples)
    
# Retrieving the list of samples which are not present in all the sets of samples present in the 'all_samples' list.
samples_not_in_all_datasets = list(set(unique_samples))

print("The samples which are not present in all the sets of samples present in the 'all_samples' list:")
samples_not_in_all_datasets

The samples which are not present in all the sets of samples present in the 'all_samples' list:


['TCGA-06-5416-01A-01']

As we can see from the output above, the only sample that is not present in all the datasets is the sample with the name 'TCGA-06-5416-01A-01'. This is also the sample that was already removed from the two methylation datasets. Therefore, to make sure that all the four datasets do contain the same samples, we can remove this sample 'TCGA-06-5416-01A-01' also from the two gene expression datasets. This can be achieved by calling the function 'drop()' to remove the column corresponding to this sample.

In [10]:
# Removing the column corresponding to the sample with the sample name 'TCGA-06-5416-01A-01' from the 'gene_expression_data' 
# and the 'gene_expression_data_log2_transformed' DataFrames.
gene_expression_data = gene_expression_data.drop('TCGA-06-5416-01A-01', axis=1)
gene_expression_data_log2_transformed = gene_expression_data_log2_transformed.drop('TCGA-06-5416-01A-01', axis=1)

# Checking whether the column corresponding to the sample with the sample name 'TCGA-06-5416-01A-01' has indeed been removed 
# from the two gene expression DataFrames by retrieving the columns of the DataFrames.
gene_expression_data_columns_sample_removed = gene_expression_data.columns[2:].tolist()
gene_expression_data_log2_transformed_columns_sample_removed = gene_expression_data_log2_transformed.columns[2:].tolist()

print("The number of samples present within the 'gene_expression_data' DataFrame: " + str(len(gene_expression_data_columns_sample_removed)))
print("The number of samples present within the 'gene_expression_data_log2_transformed' DataFrame: " + str(len(gene_expression_data_log2_transformed_columns_sample_removed)))

The number of samples present within the 'gene_expression_data' DataFrame: 64
The number of samples present within the 'gene_expression_data_log2_transformed' DataFrame: 64


As we can see from the output above, indeed the sample with the sample name 'TCGA-06-5416-01A-01' has been removed from the two gene expression datasets. To verify whether there are now no more samples that are not present in all the datasets, we can call the following code again.

In [11]:
# Creating a list of all the dataset samples.
all_samples = [methylation_data_columns, methylation_data_M_transformed_columns, gene_expression_data_columns_sample_removed, gene_expression_data_log2_transformed_columns_sample_removed]

# Creating a set where the samples that are common will be stored and initializing it with the first set of samples present 
# in the 'all_samples' list.
common_samples = set(all_samples[0])

# Looping over all the sets of samples present within the 'all_samples' list and updating the 'common_samples' set by 
# calling the function 'intersection_update()' for each of the sets of samples.
for set_of_samples in all_samples[1:]:
    common_samples.intersection_update(set_of_samples)
    
# Looping over all the sets of samples and adding the samples to the 'unique_samples' list which do not occur in all the 
# sets of samples. To achieve this, the function 'extend()' can be called.
unique_samples = []
for set_of_samples in all_samples:
    unique_samples.extend(set(set_of_samples) - common_samples)
    
# Retrieving the list of samples which are not present in all the sets of samples present in the 'all_samples' list.
samples_not_in_all_datasets = list(set(unique_samples))

print("The samples which are not present in all the sets of samples present in the 'all_samples' list:")
samples_not_in_all_datasets

The samples which are not present in all the sets of samples present in the 'all_samples' list:


[]

As we can see from the output above, there are indeed no more samples that are not present in all the datasets.

## Handling the Order of the Samples in the Datasets

Something that we could check is whether the case_id columns (representing the samples) appear in the same order in all the methylation data and gene expression data DataFrames. This is probably not the case since they are added based on the alphabetical order of the file names and these are not the same for the same case_id (which is the reason why the 'reference_table_files' exists). Therefore, the case_ids will most likely not be added in the same order. Note that these case_ids having the same order is not a necessity but does it make it easier later on, when we need to relate the columns to each other when applying the machine learning techniques. 

In [12]:
# Checking whether the case_id columns appear in the same order in the four DataFrames: 'methylation_data', 
# 'methylation_data_M_transformed', 'gene_expression_data', and 'gene_expression_data_log2_transformed'.
mismatch_found = False
for i in range(len(methylation_data_columns)):
    if (methylation_data_columns[i] != methylation_data_M_transformed_columns[i]
        or methylation_data_columns[i] != gene_expression_data_columns_sample_removed[i]
        or methylation_data_columns[i] != gene_expression_data_log2_transformed_columns_sample_removed[i]):
        mismatch_found = True
        break

if mismatch_found:
    print("The cases do not appear in the same order in all DataFrames.")
else:
    print("The cases appear in the same order in all DataFrames.")

The cases do not appear in the same order in all DataFrames.


As we already expected, the cases do not appear in the same order in the four DataFrames. What we can do next, is to sort the columns based on alphabetical order. This can be done by calling the function 'sort_index()' with as parameter 'axis=1' indicating that the columns are sorted here and not the rows.

In [13]:
# Sorting the columns of the four DataFrames based on alphabetical order.
methylation_data.sort_index(axis=1, inplace=True)
methylation_data_M_transformed.sort_index(axis=1, inplace=True)
gene_expression_data.sort_index(axis=1, inplace=True)
gene_expression_data_log2_transformed.sort_index(axis=1, inplace=True)

In [14]:
# Displaying the sorted 'methylation_data' DataFrame.
print("The 'methylation_data' DataFrame after sorting its columns based on alphabetical order:")
methylation_data

The 'methylation_data' DataFrame after sorting its columns based on alphabetical order:


Unnamed: 0,CpG sites,TCGA-06-0125-01A-01,TCGA-06-0125-02A-11,TCGA-06-0152-02A-01,TCGA-06-0171-02A-11,TCGA-06-0190-01A-01,TCGA-06-0190-02A-01,TCGA-06-0210-01A-01,TCGA-06-0210-02A-01,TCGA-06-0211-01A-01,...,TCGA-32-1980-01A-01,TCGA-32-5222-01A-01,TCGA-41-5651-01A-01,TCGA-76-4925-01A-01,TCGA-76-4926-01B-01,TCGA-76-4927-01A-01,TCGA-76-4928-01B-01,TCGA-76-4929-01A-01,TCGA-76-4931-01A-01,TCGA-76-4932-01A-01
0,cg00050873,0.668018,0.736039,0.760401,0.872146,0.758095,0.611971,0.571506,0.580738,0.598401,...,0.969237,0.471588,0.677599,0.783116,0.544668,0.752927,0.807404,0.632039,0.705596,0.668245
1,cg00212031,0.276804,0.219997,0.035574,0.036259,0.026595,0.031320,0.045200,0.102248,0.148621,...,0.025742,0.881518,0.189080,0.035286,0.030420,0.038005,0.117961,0.478576,0.121716,0.137638
2,cg00213748,0.596279,0.596430,0.895686,0.846239,0.564641,0.516280,0.589482,0.535672,0.836599,...,0.634809,0.191287,0.599771,0.498817,0.612368,0.443445,0.644493,0.622359,0.529745,0.601070
3,cg00455876,0.379327,0.646238,0.549468,0.814129,0.671254,0.664484,0.473276,0.634890,0.642416,...,0.953431,0.706968,0.658650,0.802912,0.641963,0.824415,0.758325,0.691125,0.661220,0.635963
4,cg01707559,0.459561,0.440786,0.781956,0.504077,0.026048,0.033202,0.597480,0.320871,0.671818,...,0.525477,0.859829,0.533131,0.943642,0.071227,0.755489,0.410354,0.495401,0.314803,0.449124
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
280995,ch.22.707049R,0.030177,0.031269,0.038748,0.030588,0.023218,0.029380,0.021020,0.032773,0.023967,...,0.052957,0.069201,0.033329,0.122236,0.057525,0.098730,0.078825,0.179526,0.117766,0.083062
280996,ch.22.728807R,0.042784,0.086301,0.058598,0.084550,0.069550,0.076466,0.061619,0.094007,0.074757,...,0.142701,0.067897,0.082971,0.086607,0.101038,0.101680,0.063560,0.146808,0.064936,0.100809
280997,ch.22.734399R,0.259508,0.227487,0.201834,0.123179,0.169489,0.164743,0.290668,0.274596,0.205177,...,0.232739,0.336258,0.268942,0.243600,0.281208,0.369930,0.248358,0.280876,0.357379,0.276952
280998,ch.22.772318F,0.031644,0.031728,0.034949,0.024192,0.033715,0.032427,0.031951,0.038325,0.032142,...,0.090136,0.058802,0.046672,0.103855,0.077849,0.137996,0.051274,0.157513,0.050092,0.071710


In [15]:
# Displaying the sorted 'methylation_data_M_transformed' DataFrame.
print("The 'methylation_data_M_transformed' DataFrame after sorting its columns based on alphabetical order:")
methylation_data_M_transformed

The 'methylation_data_M_transformed' DataFrame after sorting its columns based on alphabetical order:


Unnamed: 0,CpG sites,TCGA-06-0125-01A-01,TCGA-06-0125-02A-11,TCGA-06-0152-02A-01,TCGA-06-0171-02A-11,TCGA-06-0190-01A-01,TCGA-06-0190-02A-01,TCGA-06-0210-01A-01,TCGA-06-0210-02A-01,TCGA-06-0211-01A-01,...,TCGA-32-1980-01A-01,TCGA-32-5222-01A-01,TCGA-41-5651-01A-01,TCGA-76-4925-01A-01,TCGA-76-4926-01B-01,TCGA-76-4927-01A-01,TCGA-76-4928-01B-01,TCGA-76-4929-01A-01,TCGA-76-4931-01A-01,TCGA-76-4932-01A-01
0,cg00050873,1.008781,1.479459,1.666142,2.770073,1.647939,0.657298,0.415492,0.470034,0.575358,...,4.977562,-0.164138,1.071576,1.852304,0.258456,1.607573,2.067716,0.780458,1.261048,1.010256
1,cg00212031,-1.385519,-1.825992,-4.760758,-4.732231,-5.193802,-4.950884,-4.400794,-3.134249,-2.518160,...,-5.242089,2.895317,-2.100560,-4.772923,-4.994269,-4.661749,-2.902526,-0.123710,-2.851172,-2.647414
2,cg00213748,0.562630,0.563531,3.102062,2.460377,0.375128,0.093984,0.522004,0.206206,2.356115,...,0.797669,-2.079887,0.583586,-0.006827,0.659712,-0.327768,0.858285,0.720734,0.171852,0.591398
3,cg00455876,-0.710392,0.869283,0.286405,2.130957,1.029885,0.985853,-0.154363,0.798173,0.845228,...,4.355690,1.270588,0.948258,2.026400,0.842384,2.231204,1.649747,1.161921,0.964784,0.804858
4,cg01707559,-0.233873,-0.343320,1.842466,0.023525,-5.224614,-4.863882,0.569831,-1.081695,1.033575,...,0.147147,2.616856,0.191473,4.065534,-3.704820,1.627512,-0.522980,-0.026542,-1.122069,-0.294616
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
280995,ch.22.707049R,-5.006189,-4.953307,-4.632719,-4.986055,-5.394690,-5.045978,-5.541445,-4.883268,-5.347784,...,-4.160536,-3.749599,-4.858180,-2.844166,-4.034204,-3.190405,-3.546754,-2.192265,-2.905236,-3.464562
280996,ch.22.728807R,-4.483695,-3.404266,-4.005875,-3.436608,-3.741805,-3.594284,-3.928725,-3.268656,-3.629545,...,-2.586804,-3.779062,-3.466283,-3.398678,-3.153368,-3.143200,-3.881002,-2.538937,-3.847968,-3.156996
280997,ch.22.734399R,-1.512707,-1.763778,-1.983516,-2.831527,-2.292808,-2.342003,-1.287092,-1.401476,-1.953764,...,-1.721006,-0.981054,-1.442693,-1.634639,-1.353937,-0.768262,-1.597628,-1.356304,-0.846516,-1.384458
280998,ch.22.772318F,-4.935511,-4.931599,-4.787276,-5.334004,-4.840999,-4.899087,-4.921159,-4.649189,-4.912269,...,-3.335483,-4.000571,-4.352346,-3.109168,-3.566246,-2.643066,-4.209691,-2.419188,-4.245142,-3.694331


In [16]:
# Displaying the sorted 'gene_expression_data' DataFrame.
print("The 'gene_expression_data' DataFrame after sorting its columns based on alphabetical order:")
gene_expression_data

The 'gene_expression_data' DataFrame after sorting its columns based on alphabetical order:


Unnamed: 0,Gene ID,Gene Name,TCGA-06-0125-01A-01,TCGA-06-0125-02A-11,TCGA-06-0152-02A-01,TCGA-06-0171-02A-11,TCGA-06-0190-01A-01,TCGA-06-0190-02A-01,TCGA-06-0210-01A-01,TCGA-06-0210-02A-01,...,TCGA-32-1980-01A-01,TCGA-32-5222-01A-01,TCGA-41-5651-01A-01,TCGA-76-4925-01A-01,TCGA-76-4926-01B-01,TCGA-76-4927-01A-01,TCGA-76-4928-01B-01,TCGA-76-4929-01A-01,TCGA-76-4931-01A-01,TCGA-76-4932-01A-01
0,ENSG00000000003.15,TSPAN6,76.7833,82.9215,50.9843,30.1774,70.0066,40.8241,81.7969,47.4205,...,24.3036,105.1259,62.3738,78.7018,113.3587,74.8750,66.0662,117.6090,71.2790,92.9495
1,ENSG00000000005.6,TNMD,0.4035,0.4189,0.1591,0.1364,0.2893,1.1342,0.1334,0.5607,...,0.4089,0.3746,0.3449,0.4110,2.5210,0.2551,0.2063,5.6656,0.2270,0.3167
2,ENSG00000000419.13,DPM1,97.4399,63.4521,98.1366,55.5088,80.6678,85.4610,80.0162,49.7091,...,74.3346,143.1418,118.7632,169.7759,68.5659,64.4713,89.2847,90.8538,84.3173,66.3511
3,ENSG00000000457.14,SCYL3,6.5428,5.4929,6.3809,5.0426,4.8642,4.0998,4.8043,4.8995,...,5.0076,7.0083,9.4073,6.7280,6.4974,5.4887,4.2106,7.6404,6.6323,7.0174
4,ENSG00000000460.17,C1orf112,5.9849,3.1369,5.7963,2.7663,6.9529,5.4879,2.8904,4.2395,...,3.2461,7.2234,6.9749,5.8939,3.9360,4.6042,3.8754,2.3827,5.9630,5.1545
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
20432,ENSG00000288617.1,AL049777.1,8.4019,3.3954,10.8429,1.9661,6.7743,7.8791,8.4065,2.6509,...,1.1456,5.0177,19.0077,8.5134,3.1112,4.7478,2.7489,0.2630,13.6963,4.8007
20433,ENSG00000288658.1,AC010980.1,0.7025,0.6252,2.2429,0.7014,0.1727,0.4030,0.1769,0.8135,...,3.0511,0.3028,1.0637,0.7497,1.3470,1.4806,0.4309,3.6902,0.9034,1.9430
20434,ENSG00000288667.1,AC078856.1,0.0000,1.4481,0.0000,0.0000,0.0000,0.0000,0.0000,0.0000,...,0.0000,0.0000,1.6690,1.1050,0.0000,0.0000,0.0000,0.0000,0.0000,0.0000
20435,ENSG00000288670.1,AL592295.6,12.7288,15.1055,12.0011,7.6533,6.0973,5.2496,7.5160,9.0069,...,11.1699,9.1788,15.8883,16.3763,10.7157,8.0572,6.9073,9.9713,13.6745,11.6404


In [17]:
# Displaying the sorted 'gene_expression_data_log2_transformed' DataFrame.
print("The 'gene_expression_data_log2_transformed' DataFrame after sorting its columns based on alphabetical order:")
gene_expression_data_log2_transformed

The 'gene_expression_data_log2_transformed' DataFrame after sorting its columns based on alphabetical order:


Unnamed: 0,Gene ID,Gene Name,TCGA-06-0125-01A-01,TCGA-06-0125-02A-11,TCGA-06-0152-02A-01,TCGA-06-0171-02A-11,TCGA-06-0190-01A-01,TCGA-06-0190-02A-01,TCGA-06-0210-01A-01,TCGA-06-0210-02A-01,...,TCGA-32-1980-01A-01,TCGA-32-5222-01A-01,TCGA-41-5651-01A-01,TCGA-76-4925-01A-01,TCGA-76-4926-01B-01,TCGA-76-4927-01A-01,TCGA-76-4928-01B-01,TCGA-76-4929-01A-01,TCGA-76-4931-01A-01,TCGA-76-4932-01A-01
0,ENSG00000000003.15,TSPAN6,6.281389,6.390969,5.700004,4.962429,6.149881,5.386263,6.371505,5.597546,...,4.661271,6.729633,5.985815,6.316540,6.837422,6.245553,6.067514,6.890070,6.175505,6.553814
1,ENSG00000000005.6,TNMD,0.489029,0.504773,0.213005,0.184471,0.366588,1.093695,0.180657,0.642193,...,0.494569,0.459012,0.427499,0.496718,1.815985,0.327802,0.270589,2.736735,0.295135,0.396927
2,ENSG00000000419.13,DPM1,6.621171,6.010155,6.631346,5.820404,6.351695,6.433978,6.340139,5.664173,...,6.235241,7.171345,6.904041,7.415961,6.120308,6.032791,6.496410,6.521268,6.414766,6.073630
3,ENSG00000000457.14,SCYL3,2.915100,2.698863,2.883797,2.595169,2.551934,2.350441,2.537122,2.560593,...,2.586789,3.001496,3.379524,2.950095,2.906390,2.697929,2.381450,3.111098,2.932118,3.003134
4,ENSG00000000460.17,C1orf112,2.804239,2.048550,2.764750,1.913148,2.991481,2.697752,1.959918,2.389429,...,2.086138,3.039735,2.995466,2.785320,2.303342,2.486508,2.285521,1.758175,2.799709,2.621642
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
20432,ENSG00000288617.1,AL049777.1,3.232952,2.135994,3.565950,1.568567,2.958713,3.150413,3.233658,1.868252,...,1.101381,2.589212,4.322483,3.249961,2.039560,2.523010,1.906467,0.336855,3.877381,2.536227
20433,ENSG00000288658.1,AC010980.1,0.767655,0.700617,1.697285,0.766722,0.229834,0.488515,0.234992,0.858777,...,2.018314,0.381616,1.045233,0.807108,1.230818,1.310689,0.516923,2.229649,0.928579,1.557288
20434,ENSG00000288667.1,AC078856.1,0.000000,1.291662,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,1.416299,1.073820,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
20435,ENSG00000288670.1,AL592295.6,3.779134,4.009482,3.700562,3.113250,2.827270,2.643764,3.090176,3.322923,...,3.605245,3.347496,4.077952,4.119049,3.550371,3.179065,2.983185,3.455663,3.875239,3.659970


Now we can run the same code as before to verify that indeed the cases appear in the same order in all the four DataFrames.

In [18]:
# Retrieving the columns of the four DataFrames. Since in the 'methylation_data' and 'methylation_data_M_transformed' 
# DataFrames the first column contains the CpG sites, this one is excluded. Since in the 'gene_expression_data' and 
# 'gene_expression_data_log2_transformed' DataFrame the first column contains the id of the gene and the second column 
# contains the name of the gene, these are excluded.
methylation_data_columns = methylation_data.columns[1:].tolist()
methylation_data_M_transformed_columns = methylation_data_M_transformed.columns[1:].tolist()
gene_expression_data_columns = gene_expression_data.columns[2:].tolist()
gene_expression_data_log2_transformed_columns = gene_expression_data_log2_transformed.columns[2:].tolist()

# Checking whether the case_id columns appear in the same order in the four DataFrames: 'methylation_data', 
# 'methylation_data_M_transformed', 'gene_expression_data', and 'gene_expression_data_log2_transformed'.
mismatch_found = False
for i in range(len(methylation_data_columns)):
    if (methylation_data_columns[i] != methylation_data_M_transformed_columns[i]
        or methylation_data_columns[i] != gene_expression_data_columns[i]
        or methylation_data_columns[i] != gene_expression_data_log2_transformed_columns[i]):
        mismatch_found = True
        break

if mismatch_found:
    print("The cases do not appear in the same order in all DataFrames.")
else:
    print("The cases appear in the same order in all DataFrames.")

The cases appear in the same order in all DataFrames.


## Storing the Resulting Methylation and Gene Expression DataFrames

Now we can store the resulting methylation and gene expression DataFrames to the directory 'data_directory_combined_cleaned_files' by calling the function 'to.csv()' for each DataFrame.

#### Storing the 'methylation_data' DataFrame

In [19]:
# Defining where to save the resulting file and its name.
file_to_save = data_directory_combined_cleaned_files + "/methylation_data_cleaned_sorted.csv"

# Saving the file.
if not os.path.exists(file_to_save):
    methylation_data.to_csv(file_to_save, index=False)
    print("The file with the path " + file_to_save + " has been created.")
else:
    print("The file with the path " + file_to_save + " already exists.")

The file with the path C:/Users/laure/OneDrive/Documenten/Bachelor Thesis Data/combined_cleaned_data/methylation_data_cleaned_sorted.csv already exists.


#### Storing the 'methylation_data_M_transformed' DataFrame

In [20]:
# Defining where to save the resulting file and its name.
file_to_save = data_directory_combined_cleaned_files + "/methylation_data_cleaned_M_transformed_sorted.csv"

# Saving the file.
if not os.path.exists(file_to_save):
    methylation_data_M_transformed.to_csv(file_to_save, index=False)
    print("The file with the path " + file_to_save + " has been created.")
else:
    print("The file with the path " + file_to_save + " already exists.")

The file with the path C:/Users/laure/OneDrive/Documenten/Bachelor Thesis Data/combined_cleaned_data/methylation_data_cleaned_M_transformed_sorted.csv already exists.


#### Storing the 'gene_expression_data' DataFrame

In [21]:
# Defining where to save the resulting file and its name.
file_to_save = data_directory_combined_cleaned_files + "/gene_expression_data_cleaned_sorted.csv"

# Saving the file.
if not os.path.exists(file_to_save):
    gene_expression_data.to_csv(file_to_save, index=False)
    print("The file with the path " + file_to_save + " has been created.")
else:
    print("The file with the path " + file_to_save + " already exists.")

The file with the path C:/Users/laure/OneDrive/Documenten/Bachelor Thesis Data/combined_cleaned_data/gene_expression_data_cleaned_sorted.csv has been created.


#### Storing the 'gene_expression_data_log2_transformed' DataFrame

In [22]:
# Defining where to save the resulting file and its name.
file_to_save = data_directory_combined_cleaned_files + "/gene_expression_data_cleaned_log2_transformed_sorted.csv"

# Saving the file.
if not os.path.exists(file_to_save):
    gene_expression_data_log2_transformed.to_csv(file_to_save, index=False)
    print("The file with the path " + file_to_save + " has been created.")
else:
    print("The file with the path " + file_to_save + " already exists.")

The file with the path C:/Users/laure/OneDrive/Documenten/Bachelor Thesis Data/combined_cleaned_data/gene_expression_data_cleaned_log2_transformed_sorted.csv has been created.
