# Transposing Final Methylation Data and Gene Expression Data Files
### Laurence Nickel (i6257119)

Libraries used: 
* pandas (version: '1.2.4')
* sys (version: '3.8.8')
* os (version: '3.8.8')

References:
* [1] scikit-learn (2023). Getting Started - scikit-learn 1.2.2. Available: https://scikit-learn.org/stable/getting_started.html (last accessed May 18, 2023).

## Introduction

Within this notebook, the methylation data and gene expression data files present within the directory 'Bachelor Thesis Data/further_processed_datasets' are transposed such that the rows represent the samples and the columns represent the CpG sites/genes. The reason for this is that most machine learning libraries, including scikit-learn which will be used a lot, require the samples to be represented by the rows and the features (CpG sites/genes) to be represented by the columns [1]. This is the typical format expected by function like 'model.fit()'. Currently, our data features the samples as the columns and the CpG sites/genes as the rows so transposing the data files is required. Performing this transposing here in a separate notebook and storing the resulting files separately allows us to later load them into any notebook without having to transpose them in each of the notebooks separately (which is memory and time consuming).

This is done for the following files (which are present in the directory 'data_directory_further_processed_datasets' defined later in the notebook):
* 'methylation_data.csv' file
* 'methylation_data_M_transformed.csv' file
* 'gene_expression_data.csv' file
* 'gene_expression_data_log2_transformed.csv' file

### Importing libraries

Before we can start to transpose the methylation data and gene expression data files, we should first import some libraries that will be used throughout this notebook.

In [3]:
print("Starting the importing of the libraries...")


import pandas as pd
import sys
import os


print("Finishing the installing of the libraries.")

Starting the importing of the libraries...
Finishing the installing of the libraries.


Now that all the libraries have been imported, we can verify that these libraries have been loaded into this notebook by calling the version property of the library.

In [4]:
# Retrieving the version of the libraries to verify they have been correctly loaded into this notebook.
print("The library 'pd' (pandas) has been loaded into the notebook with its version being:")
print(pd.__version__)

print("\nThe library 'sys' has been loaded into the notebook with its version being:")
print(sys.version)

The library 'pd' (pandas) has been loaded into the notebook with its version being:
1.2.4

The library 'sys' has been loaded into the notebook with its version being:
3.8.8 (default, Apr 13 2021, 15:08:03) [MSC v.1916 64 bit (AMD64)]


### Defining the data directories

In addition, we also need to define our data directories from which the files will be loaded and to which the resulting files will be stored. Please mind that these need to be changed to the desired directories to be able to work with the data directories.

In [5]:
data_directory_further_processed_datasets = "C:/Users/laure/OneDrive/Documenten/Bachelor Thesis Data/further_processed_datasets"
data_directory_final_datasets = "C:/Users/laure/OneDrive/Documenten/Bachelor Thesis Data/final_datasets"

## Loading Methylation Data Files

Within this section, we can load the methylation data files into this notebook by calling the function 'pd.read_csv()'.

#### Loading the 'methylation_data.csv' file into this notebook

In [6]:
# Loading the 'methylation_data.csv' file into this notebook by calling the function 'pd.read_csv()'.
methylation_data = pd.read_csv(data_directory_further_processed_datasets + "/" + "methylation_data.csv")

print("The 'methylation_data' DataFrame containing the data from the 'methylation_data.csv' file:")
methylation_data

The 'methylation_data' DataFrame containing the data from the 'methylation_data.csv' file:


Unnamed: 0,CpG sites,TCGA-06-0125-01A-01,TCGA-06-0125-02A-11,TCGA-06-0152-02A-01,TCGA-06-0171-02A-11,TCGA-06-0190-01A-01,TCGA-06-0190-02A-01,TCGA-06-0210-01A-01,TCGA-06-0210-02A-01,TCGA-06-0211-01A-01,...,TCGA-32-1980-01A-01,TCGA-32-5222-01A-01,TCGA-41-5651-01A-01,TCGA-76-4925-01A-01,TCGA-76-4926-01B-01,TCGA-76-4927-01A-01,TCGA-76-4928-01B-01,TCGA-76-4929-01A-01,TCGA-76-4931-01A-01,TCGA-76-4932-01A-01
0,cg00000957,0.897657,0.901618,0.943356,0.946300,0.900578,0.848567,0.910623,0.915881,0.944165,...,0.897398,0.862443,0.913197,0.906669,0.894053,0.914987,0.809969,0.899810,0.846634,0.890310
1,cg00001349,0.939643,0.934091,0.925107,0.892850,0.913929,0.911817,0.923782,0.905575,0.945036,...,0.935178,0.891986,0.901158,0.917937,0.941616,0.944935,0.934087,0.915504,0.925314,0.920439
2,cg00001583,0.022323,0.021567,0.688869,0.316700,0.039247,0.020791,0.017738,0.019118,0.512130,...,0.108900,0.061066,0.665138,0.710870,0.014355,0.453672,0.015485,0.014822,0.917137,0.128676
3,cg00002837,0.835868,0.721376,0.463914,0.468883,0.459649,0.327587,0.133814,0.186156,0.787476,...,0.455449,0.523737,0.906749,0.924157,0.452585,0.498965,0.559392,0.764733,0.600138,0.899721
4,cg00003287,0.609458,0.643082,0.510393,0.199371,0.385464,0.413962,0.098051,0.154888,0.178655,...,0.264349,0.588700,0.108046,0.073303,0.213188,0.369733,0.378124,0.036500,0.613448,0.057833
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
270873,ch.22.707049R,0.030177,0.031269,0.038748,0.030588,0.023218,0.029380,0.021020,0.032773,0.023967,...,0.052957,0.069201,0.033329,0.122236,0.057525,0.098730,0.078825,0.179526,0.117766,0.083062
270874,ch.22.728807R,0.042784,0.086301,0.058598,0.084550,0.069550,0.076466,0.061619,0.094007,0.074757,...,0.142701,0.067897,0.082971,0.086607,0.101038,0.101680,0.063560,0.146808,0.064936,0.100809
270875,ch.22.734399R,0.259508,0.227487,0.201834,0.123179,0.169489,0.164743,0.290668,0.274596,0.205177,...,0.232739,0.336258,0.268942,0.243600,0.281208,0.369930,0.248358,0.280876,0.357379,0.276952
270876,ch.22.772318F,0.031644,0.031728,0.034949,0.024192,0.033715,0.032427,0.031951,0.038325,0.032142,...,0.090136,0.058802,0.046672,0.103855,0.077849,0.137996,0.051274,0.157513,0.050092,0.071710


#### Loading the 'methylation_data_M_transformed.csv' file into this notebook

In [7]:
# Loading the 'methylation_data_M_transformed.csv' file into this notebook by calling the function 'pd.read_csv()'.
methylation_data_M_transformed = pd.read_csv(data_directory_further_processed_datasets + "/" + "methylation_data_M_transformed.csv")

print("The 'methylation_data_M_transformed' DataFrame containing the data from the 'methylation_data_M_transformed.csv' file:")
methylation_data_M_transformed

The 'methylation_data_M_transformed' DataFrame containing the data from the 'methylation_data_M_transformed.csv' file:


Unnamed: 0,CpG sites,TCGA-06-0125-01A-01,TCGA-06-0125-02A-11,TCGA-06-0152-02A-01,TCGA-06-0171-02A-11,TCGA-06-0190-01A-01,TCGA-06-0190-02A-01,TCGA-06-0210-01A-01,TCGA-06-0210-02A-01,TCGA-06-0211-01A-01,...,TCGA-32-1980-01A-01,TCGA-32-5222-01A-01,TCGA-41-5651-01A-01,TCGA-76-4925-01A-01,TCGA-76-4926-01B-01,TCGA-76-4927-01A-01,TCGA-76-4928-01B-01,TCGA-76-4929-01A-01,TCGA-76-4931-01A-01,TCGA-76-4932-01A-01
0,cg00000957,3.132755,3.196057,4.057813,4.139295,3.179215,2.486346,3.348881,3.444665,4.079807,...,3.128688,2.648402,3.395115,3.280146,3.077022,3.427987,2.091628,3.166884,2.464759,3.020878
1,cg00001349,3.960518,3.825019,3.626717,3.058785,3.408476,3.370171,3.599340,3.261595,4.103812,...,3.850690,3.045809,3.188580,3.483593,4.011492,4.101014,3.824918,3.437609,3.631037,3.532181
2,cg00001583,-5.452737,-5.503606,1.146710,-1.109405,-4.613496,-5.557553,-5.791192,-5.681068,0.070015,...,-3.032589,-3.942593,0.990085,1.297866,-6.101468,-0.268118,-5.990499,-6.054600,3.468339,-2.759466
3,cg00002837,2.348422,1.372434,-0.208610,-0.179801,-0.233366,-1.037472,-2.694442,-2.128238,1.889610,...,-0.257780,0.137085,3.281515,3.607051,-0.274449,-0.005975,0.344366,1.700659,0.585791,3.165462
4,cg00003287,0.642046,0.849407,0.059986,-2.005682,-0.672902,-0.501496,-3.201449,-2.447921,-2.200817,...,-1.476577,0.517344,-3.045329,-3.660157,-1.883894,-0.769480,-0.717767,-4.722319,0.666281,-4.026015
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
270873,ch.22.707049R,-5.006189,-4.953307,-4.632719,-4.986055,-5.394690,-5.045978,-5.541445,-4.883268,-5.347784,...,-4.160536,-3.749599,-4.858180,-2.844166,-4.034204,-3.190405,-3.546754,-2.192265,-2.905236,-3.464562
270874,ch.22.728807R,-4.483695,-3.404266,-4.005875,-3.436608,-3.741805,-3.594284,-3.928725,-3.268656,-3.629545,...,-2.586804,-3.779062,-3.466283,-3.398678,-3.153368,-3.143200,-3.881002,-2.538937,-3.847968,-3.156996
270875,ch.22.734399R,-1.512707,-1.763778,-1.983516,-2.831527,-2.292808,-2.342003,-1.287092,-1.401476,-1.953764,...,-1.721006,-0.981054,-1.442693,-1.634639,-1.353937,-0.768262,-1.597628,-1.356304,-0.846516,-1.384458
270876,ch.22.772318F,-4.935511,-4.931599,-4.787276,-5.334004,-4.840999,-4.899087,-4.921159,-4.649189,-4.912269,...,-3.335483,-4.000571,-4.352346,-3.109168,-3.566246,-2.643066,-4.209691,-2.419188,-4.245142,-3.694331


## Loading Gene Expression Data Files

Within this section, we can load the gene expression data files into this notebook by calling the function 'pd.read_csv()'.

#### Loading the 'gene_expression_data.csv' file into this notebook

In [8]:
# Loading the 'gene_expression_data.csv' file into this notebook by calling the function 'pd.read_csv()'.
gene_expression_data = pd.read_csv(data_directory_further_processed_datasets + "/" + "gene_expression_data.csv")

print("The 'gene_expression_data' DataFrame containing the data from the 'gene_expression_data.csv' file:")
gene_expression_data

The 'gene_expression_data' DataFrame containing the data from the 'gene_expression_data.csv' file:


Unnamed: 0,Gene ID,Gene Name,TCGA-06-0125-01A-01,TCGA-06-0125-02A-11,TCGA-06-0152-02A-01,TCGA-06-0171-02A-11,TCGA-06-0190-01A-01,TCGA-06-0190-02A-01,TCGA-06-0210-01A-01,TCGA-06-0210-02A-01,...,TCGA-32-1980-01A-01,TCGA-32-5222-01A-01,TCGA-41-5651-01A-01,TCGA-76-4925-01A-01,TCGA-76-4926-01B-01,TCGA-76-4927-01A-01,TCGA-76-4928-01B-01,TCGA-76-4929-01A-01,TCGA-76-4931-01A-01,TCGA-76-4932-01A-01
0,ENSG00000000419,DPM1,97.4399,63.4521,98.1366,55.5088,80.6678,85.4610,80.0162,49.7091,...,74.3346,143.1418,118.7632,169.7759,68.5659,64.4713,89.2847,90.8538,84.3173,66.3511
1,ENSG00000000457,SCYL3,6.5428,5.4929,6.3809,5.0426,4.8642,4.0998,4.8043,4.8995,...,5.0076,7.0083,9.4073,6.7280,6.4974,5.4887,4.2106,7.6404,6.6323,7.0174
2,ENSG00000000460,C1orf112,5.9849,3.1369,5.7963,2.7663,6.9529,5.4879,2.8904,4.2395,...,3.2461,7.2234,6.9749,5.8939,3.9360,4.6042,3.8754,2.3827,5.9630,5.1545
3,ENSG00000000938,FGR,7.0651,16.4290,21.9912,65.6843,25.5317,11.9973,26.2095,17.0881,...,11.1232,36.0053,9.5427,3.8866,15.2813,11.1335,15.9864,23.5793,10.5260,13.9823
4,ENSG00000000971,CFH,16.9301,16.4273,19.3073,50.7959,24.9163,45.3018,12.3576,14.5879,...,17.2011,8.1233,4.4526,11.4120,22.3823,13.0230,6.9736,10.0796,9.3104,3.4071
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
19626,ENSG00000288612,AL133351.4,5.5178,2.1185,3.5844,3.3664,2.6690,2.2014,1.9550,1.4535,...,7.9922,0.6027,4.2174,1.0777,3.2050,1.2770,2.0797,2.8821,9.1919,1.1323
19627,ENSG00000288658,AC010980.1,0.7025,0.6252,2.2429,0.7014,0.1727,0.4030,0.1769,0.8135,...,3.0511,0.3028,1.0637,0.7497,1.3470,1.4806,0.4309,3.6902,0.9034,1.9430
19628,ENSG00000288667,AC078856.1,0.0000,1.4481,0.0000,0.0000,0.0000,0.0000,0.0000,0.0000,...,0.0000,0.0000,1.6690,1.1050,0.0000,0.0000,0.0000,0.0000,0.0000,0.0000
19629,ENSG00000288670,AL592295.6,12.7288,15.1055,12.0011,7.6533,6.0973,5.2496,7.5160,9.0069,...,11.1699,9.1788,15.8883,16.3763,10.7157,8.0572,6.9073,9.9713,13.6745,11.6404


#### Loading the 'gene_expression_data_log2_transformed.csv' file into this notebook

In [9]:
# Loading the 'gene_expression_data_log2_transformed.csv' file into this notebook by calling the function 'pd.read_csv()'.
gene_expression_data_log2_transformed = pd.read_csv(data_directory_further_processed_datasets + "/" + "gene_expression_data_log2_transformed.csv")

print("The 'gene_expression_data_log2_transformed' DataFrame containing the data from the 'gene_expression_data_log2_transformed.csv' file:")
gene_expression_data_log2_transformed

The 'gene_expression_data_log2_transformed' DataFrame containing the data from the 'gene_expression_data_log2_transformed.csv' file:


Unnamed: 0,Gene ID,Gene Name,TCGA-06-0125-01A-01,TCGA-06-0125-02A-11,TCGA-06-0152-02A-01,TCGA-06-0171-02A-11,TCGA-06-0190-01A-01,TCGA-06-0190-02A-01,TCGA-06-0210-01A-01,TCGA-06-0210-02A-01,...,TCGA-32-1980-01A-01,TCGA-32-5222-01A-01,TCGA-41-5651-01A-01,TCGA-76-4925-01A-01,TCGA-76-4926-01B-01,TCGA-76-4927-01A-01,TCGA-76-4928-01B-01,TCGA-76-4929-01A-01,TCGA-76-4931-01A-01,TCGA-76-4932-01A-01
0,ENSG00000000419,DPM1,6.621171,6.010155,6.631346,5.820404,6.351695,6.433978,6.340139,5.664173,...,6.235241,7.171345,6.904041,7.415961,6.120308,6.032791,6.496410,6.521268,6.414766,6.073630
1,ENSG00000000457,SCYL3,2.915100,2.698863,2.883797,2.595169,2.551934,2.350441,2.537122,2.560593,...,2.586789,3.001496,3.379524,2.950095,2.906390,2.697929,2.381450,3.111098,2.932118,3.003134
2,ENSG00000000460,C1orf112,2.804239,2.048550,2.764750,1.913148,2.991481,2.697752,1.959918,2.389429,...,2.086138,3.039735,2.995466,2.785320,2.303342,2.486508,2.285521,1.758175,2.799709,2.621642
3,ENSG00000000938,FGR,3.011692,4.123418,4.523010,6.059275,4.729645,3.700140,4.766039,4.176969,...,3.599699,5.209660,3.398172,2.288831,4.025144,3.600924,4.086308,4.619372,3.526820,3.905187
4,ENSG00000000971,CFH,4.164312,4.123277,4.343927,5.694766,4.695788,5.532996,3.739589,3.962355,...,4.185954,3.189556,2.446944,3.633664,4.547345,3.809723,2.995231,3.469834,3.366028,2.139830
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
19626,ENSG00000288612,AL133351.4,2.704385,1.640852,2.196733,2.126444,1.875387,1.678703,1.563158,1.294841,...,3.168674,0.680504,2.383331,1.054987,2.072106,1.187134,1.622790,1.956837,3.349351,1.092410
19627,ENSG00000288658,AC010980.1,0.767655,0.700617,1.697285,0.766722,0.229834,0.488515,0.234992,0.858777,...,2.018314,0.381616,1.045233,0.807108,1.230818,1.310689,0.516923,2.229649,0.928579,1.557288
19628,ENSG00000288667,AC078856.1,0.000000,1.291662,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,1.416299,1.073820,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
19629,ENSG00000288670,AL592295.6,3.779134,4.009482,3.700562,3.113250,2.827270,2.643764,3.090176,3.322923,...,3.605245,3.347496,4.077952,4.119049,3.550371,3.179065,2.983185,3.455663,3.875239,3.659970


## Transposing Methylation and Gene Expression DataFrames

Within this section, we will transpose the methylation and gene expression DataFrames. This can be achieved by calling the function 'transpose()' on the DataFrames.

### Transposing the Methylation DataFrames

#### Transposing the 'methylation_data' DataFrame

In [10]:
# Copying the 'methylation_data' DataFrame.
methylation_data_copy = methylation_data.copy()

# Calling the functions 'set_index()' and 'rename_axis()' to respectively set the CpG sites to be the names of the rows of 
# the 'methylation_data_copy' DataFrame and to rename the axis such that when the data is transposed it represents the 
# samples rather than the CpG sites.
methylation_data_copy = methylation_data_copy.set_index('CpG sites')
methylation_data_copy = methylation_data_copy.rename_axis('')

# Transposing the 'methylation_data_copy' DataFrame by calling the function 'transpose()'.
methylation_data_T = methylation_data_copy.transpose()

# Setting the first column of the 'methylation_data_T' DataFrame to contain the names of the samples.
methylation_data_T.insert(0, 'Samples', methylation_data_T.index)
methylation_data_T.index = range(len(methylation_data_T))

print("The methylation data after transposing the DataFrame:")
methylation_data_T

The methylation data after transposing the DataFrame:


Unnamed: 0,Samples,cg00000957,cg00001349,cg00001583,cg00002837,cg00003287,cg00004121,cg00008647,cg00009292,cg00011717,...,ch.22.28920330F,ch.22.436090R,ch.22.441164F,ch.22.528917R,ch.22.569473R,ch.22.707049R,ch.22.728807R,ch.22.734399R,ch.22.772318F,ch.22.909671F
0,TCGA-06-0125-01A-01,0.897657,0.939643,0.022323,0.835868,0.609458,0.573148,0.077747,0.515607,0.962390,...,0.047403,0.247444,0.023274,0.048324,0.092222,0.030177,0.042784,0.259508,0.031644,0.041573
1,TCGA-06-0125-02A-11,0.901618,0.934091,0.021567,0.721376,0.643082,0.607448,0.092594,0.702981,0.981128,...,0.121478,0.232804,0.027033,0.044326,0.104435,0.031269,0.086301,0.227487,0.031728,0.072908
2,TCGA-06-0152-02A-01,0.943356,0.925107,0.688869,0.463914,0.510393,0.633310,0.342348,0.727519,0.981531,...,0.039210,0.176567,0.021798,0.031384,0.166394,0.038748,0.058598,0.201834,0.034949,0.066590
3,TCGA-06-0171-02A-11,0.946300,0.892850,0.316700,0.468883,0.199371,0.608267,0.051528,0.343695,0.987614,...,0.090471,0.110574,0.018514,0.059065,0.084249,0.030588,0.084550,0.123179,0.024192,0.048582
4,TCGA-06-0190-01A-01,0.900578,0.913929,0.039247,0.459649,0.385464,0.563823,0.256340,0.620945,0.960253,...,0.061767,0.136060,0.022281,0.045415,0.076932,0.023218,0.069550,0.169489,0.033715,0.060410
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
59,TCGA-76-4927-01A-01,0.914987,0.944935,0.453672,0.498965,0.369733,0.650784,0.029887,0.702710,0.983409,...,0.044978,0.256822,0.090731,0.177181,0.083665,0.098730,0.101680,0.369930,0.137996,0.282453
60,TCGA-76-4928-01B-01,0.809969,0.934087,0.015485,0.559392,0.378124,0.573251,0.021810,0.563863,0.981317,...,0.023597,0.326470,0.043576,0.113330,0.072018,0.078825,0.063560,0.248358,0.051274,0.166855
61,TCGA-76-4929-01A-01,0.899810,0.915504,0.014822,0.764733,0.036500,0.588300,0.023903,0.747113,0.918559,...,0.051385,0.256155,0.062192,0.132334,0.089363,0.179526,0.146808,0.280876,0.157513,0.248242
62,TCGA-76-4931-01A-01,0.846634,0.925314,0.917137,0.600138,0.613448,0.647793,0.019436,0.912443,0.942751,...,0.035018,0.261144,0.050211,0.050398,0.076935,0.117766,0.064936,0.357379,0.050092,0.217461


#### Transposing the 'methylation_data_M_transformed' DataFrame

In [11]:
# Copying the 'methylation_data_M_transformed' DataFrame.
methylation_data_M_transformed_copy = methylation_data_M_transformed.copy()

# Calling the functions 'set_index()' and 'rename_axis()' to respectively set the CpG sites to be the names of the rows of 
# the 'methylation_data_M_transformed_copy' DataFrame and to rename the axis such that when the data is transposed it 
# represents the samples rather than the CpG sites.
methylation_data_M_transformed_copy = methylation_data_M_transformed_copy.set_index('CpG sites')
methylation_data_M_transformed_copy = methylation_data_M_transformed_copy.rename_axis('')

# Transposing the 'methylation_data_M_transformed_copy' DataFrame by calling the function 'transpose()'.
methylation_data_M_transformed_T = methylation_data_M_transformed_copy.transpose()

# Setting the first column of the 'methylation_data_M_transformed_T' DataFrame to contain the names of the samples.
methylation_data_M_transformed_T.insert(0, 'Samples', methylation_data_M_transformed_T.index)
methylation_data_M_transformed_T.index = range(len(methylation_data_M_transformed_T))

print("The methylation data M transformed after transposing the DataFrame:")
methylation_data_M_transformed_T

The methylation data M transformed after transposing the DataFrame:


Unnamed: 0,Samples,cg00000957,cg00001349,cg00001583,cg00002837,cg00003287,cg00004121,cg00008647,cg00009292,cg00011717,...,ch.22.28920330F,ch.22.436090R,ch.22.441164F,ch.22.528917R,ch.22.569473R,ch.22.707049R,ch.22.728807R,ch.22.734399R,ch.22.772318F,ch.22.909671F
0,TCGA-06-0125-01A-01,3.132755,3.960518,-5.452737,2.348422,0.642046,0.425173,-3.568310,0.090094,4.677447,...,-4.328807,-1.604700,-5.391144,-4.299671,-3.299155,-5.006189,-4.483695,-1.512707,-4.935511,-4.526937
1,TCGA-06-0125-02A-11,3.196057,3.825019,-5.503606,1.372434,0.849407,0.629880,-3.292764,1.242929,5.700119,...,-2.854379,-1.720475,-5.169584,-4.430285,-3.100187,-4.953307,-3.404266,-1.763778,-4.931599,-3.668569
2,TCGA-06-0152-02A-01,4.057813,3.626717,1.146710,-0.208610,0.059986,0.788350,-0.941862,1.416831,5.731835,...,-4.614915,-2.221432,-5.487836,-4.947837,-2.324764,-4.632719,-4.005875,-1.983516,-4.787276,-3.809136
3,TCGA-06-0171-02A-11,4.139295,3.058785,-1.109405,-0.179801,-2.005682,0.634832,-4.202176,-0.933237,6.317184,...,-3.329587,-3.007868,-5.728251,-3.993713,-3.442227,-4.986055,-3.436608,-2.831527,-5.334004,-4.291577
4,TCGA-06-0190-01A-01,3.179215,3.408476,-4.613496,-0.233366,-0.672902,0.370326,-1.536587,0.712060,4.594485,...,-3.925038,-2.666686,-5.455511,-4.393648,-3.584784,-5.394690,-3.741805,-2.292808,-4.840999,-3.959180
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
59,TCGA-76-4927-01A-01,3.427987,4.101014,-0.268118,-0.005975,-0.769480,0.898056,-5.020576,1.241059,5.889282,...,-4.408250,-1.532935,-3.325032,-2.215352,-3.453181,-3.190405,-3.143200,-0.768262,-2.643066,-1.345063
60,TCGA-76-4928-01B-01,2.091628,3.824918,-5.990499,0.344366,-0.717767,0.425782,-5.487022,0.370565,5.714939,...,-5.370823,-1.044793,-4.456032,-2.967873,-3.687673,-3.546754,-3.881002,-1.597628,-4.209691,-2.319970
61,TCGA-76-4929-01A-01,3.166884,3.437609,-6.054600,1.700659,-4.722319,0.514961,-5.351739,1.562834,3.495553,...,-4.206406,-1.537983,-3.914480,-2.712960,-3.349122,-2.192265,-2.538937,-1.356304,-2.419188,-1.598524
62,TCGA-76-4931-01A-01,2.464759,3.631037,3.468339,0.585791,0.666281,0.879107,-5.656808,3.381448,4.041543,...,-4.784318,-1.500444,-4.241542,-4.235881,-3.584725,-2.905236,-3.847968,-0.846516,-4.245142,-1.847410


### Transposing the Gene Expression DataFrames

#### Transposing the 'gene_expression_data' DataFrame

In [12]:
# Copying the 'gene_expression_data' DataFrame.
gene_expression_data_copy = gene_expression_data.copy()

# Calling the functions 'set_index()' and 'rename_axis()' to respectively set the genes to be the names of the rows of the
# the 'gene_expression_data_copy' DataFrame and to rename the axis such that when the data is transposed it represents the 
# samples rather than the genes.
gene_expression_data_copy = gene_expression_data_copy.set_index('Gene ID')
gene_expression_data_copy = gene_expression_data_copy.rename_axis('')

# The column 'Gene Name' can also be removed as the gene IDs present in the 'Gene ID' column do already uniquely correspond 
# to the records.
gene_expression_data_copy = gene_expression_data_copy.drop('Gene Name', axis=1)

# Transposing the 'gene_expression_data_copy' DataFrame by calling the function 'transpose()'.
gene_expression_data_T = gene_expression_data_copy.transpose()

# Setting the first column of the 'gene_expression_data_T' DataFrame to contain the names of the samples.
gene_expression_data_T.insert(0, 'Samples', gene_expression_data_T.index)
gene_expression_data_T.index = range(len(gene_expression_data_T))

print("The gene expression data after transposing the DataFrame:")
gene_expression_data_T

The gene expression data after transposing the DataFrame:


Unnamed: 0,Samples,ENSG00000000419,ENSG00000000457,ENSG00000000460,ENSG00000000938,ENSG00000000971,ENSG00000001036,ENSG00000001084,ENSG00000001167,ENSG00000001460,...,ENSG00000288558,ENSG00000288559,ENSG00000288573,ENSG00000288586,ENSG00000288596,ENSG00000288612,ENSG00000288658,ENSG00000288667,ENSG00000288670,ENSG00000288675
0,TCGA-06-0125-01A-01,97.4399,6.5428,5.9849,7.0651,16.9301,63.7275,14.5107,24.2783,0.9795,...,9.1356,15.5325,5.4792,1.9232,7.2505,5.5178,0.7025,0.0000,12.7288,4.1649
1,TCGA-06-0125-02A-11,63.4521,5.4929,3.1369,16.4290,16.4273,66.9871,22.8885,18.1955,1.4115,...,6.0041,8.3656,4.2820,1.6080,2.8211,2.1185,0.6252,1.4481,15.1055,2.2084
2,TCGA-06-0152-02A-01,98.1366,6.3809,5.7963,21.9912,19.3073,73.1009,9.4840,24.5692,1.3706,...,9.1559,10.9328,4.6383,1.7929,6.1256,3.5844,2.2429,0.0000,12.0011,3.3086
3,TCGA-06-0171-02A-11,55.5088,5.0426,2.7663,65.6843,50.7959,65.8725,20.9938,22.3543,4.6538,...,3.5008,3.8793,2.8926,0.9067,2.7872,3.3664,0.7014,0.0000,7.6533,0.5195
4,TCGA-06-0190-01A-01,80.6678,4.8642,6.9529,25.5317,24.9163,101.1771,17.4425,28.6862,5.4792,...,4.4420,7.2659,3.2201,0.9027,4.5347,2.6690,0.1727,0.0000,6.0973,3.0503
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
59,TCGA-76-4927-01A-01,64.4713,5.4887,4.6042,11.1335,13.0230,79.9233,16.0202,25.2603,8.1108,...,11.8293,8.2600,2.7040,1.2529,3.3570,1.2770,1.4806,0.0000,8.0572,5.0802
60,TCGA-76-4928-01B-01,89.2847,4.2106,3.8754,15.9864,6.9736,84.9414,9.9544,19.4110,3.5411,...,7.1290,4.8864,2.6234,0.3575,4.1562,2.0797,0.4309,0.0000,6.9073,1.6309
61,TCGA-76-4929-01A-01,90.8538,7.6404,2.3827,23.5793,10.0796,29.7248,18.8568,47.1435,5.5205,...,9.7622,6.5039,4.7981,3.0880,4.4396,2.8821,3.6902,0.0000,9.9713,8.2106
62,TCGA-76-4931-01A-01,84.3173,6.6323,5.9630,10.5260,9.3104,28.4352,16.0662,27.2980,3.2969,...,9.5540,12.3694,8.1206,1.9674,8.9614,9.1919,0.9034,0.0000,13.6745,6.0827


#### Transposing the 'gene_expression_data_log2_transformed' DataFrame

In [13]:
# Copying the 'gene_expression_data_log2_transformed' DataFrame.
gene_expression_data_log2_transformed_copy = gene_expression_data_log2_transformed.copy()

# Calling the functions 'set_index()' and 'rename_axis()' to respectively set the genes to be the names of the rows of the
# the 'gene_expression_data_log2_transformed_copy' DataFrame and to rename the axis such that when the data is transposed it 
# represents the samples rather than the genes.
gene_expression_data_log2_transformed_copy = gene_expression_data_log2_transformed_copy.set_index('Gene ID')
gene_expression_data_log2_transformed_copy = gene_expression_data_log2_transformed_copy.rename_axis('')

# The column 'Gene Name' can also be removed as the gene IDs present in the 'Gene ID' column do already uniquely correspond 
# to the records.
gene_expression_data_log2_transformed_copy = gene_expression_data_log2_transformed_copy.drop('Gene Name', axis=1)

# Transposing the 'gene_expression_data_log2_transformed_copy' DataFrame by calling the function 'transpose()'.
gene_expression_data_log2_transformed_T = gene_expression_data_log2_transformed_copy.transpose()

# Setting the first column of the 'gene_expression_data_log2_transformed_T' DataFrame to contain the names of the samples.
gene_expression_data_log2_transformed_T.insert(0, 'Samples', gene_expression_data_log2_transformed_T.index)
gene_expression_data_log2_transformed_T.index = range(len(gene_expression_data_log2_transformed_T))

print("The gene expression data log2-transformed after transposing the DataFrame:")
gene_expression_data_log2_transformed_T

The gene expression data log2-transformed after transposing the DataFrame:


Unnamed: 0,Samples,ENSG00000000419,ENSG00000000457,ENSG00000000460,ENSG00000000938,ENSG00000000971,ENSG00000001036,ENSG00000001084,ENSG00000001167,ENSG00000001460,...,ENSG00000288558,ENSG00000288559,ENSG00000288573,ENSG00000288586,ENSG00000288596,ENSG00000288612,ENSG00000288658,ENSG00000288667,ENSG00000288670,ENSG00000288675
0,TCGA-06-0125-01A-01,6.621171,2.915100,2.804239,3.011692,4.164312,6.016307,3.955192,4.659828,0.985136,...,3.341360,4.047233,2.695816,1.547549,3.044482,2.704385,0.767655,0.000000,3.779134,2.368740
1,TCGA-06-0125-02A-11,6.010155,2.698863,2.048550,4.123418,4.123277,6.087189,4.578244,4.262696,1.269931,...,2.808200,3.227371,2.401084,1.382944,1.933988,1.640852,0.700617,1.291662,4.009482,1.681854
2,TCGA-06-0152-02A-01,6.631346,2.883797,2.764750,4.523010,4.343927,6.211419,3.390117,4.676335,1.245252,...,3.344246,3.576861,2.495260,1.481764,2.833011,2.196733,1.697285,0.000000,3.700562,2.107219
3,TCGA-06-0171-02A-11,5.820404,2.595169,1.913148,6.059275,5.694766,6.063341,4.459025,4.545616,2.499221,...,2.170181,2.286674,1.960734,0.931078,1.921132,2.126444,0.766722,0.000000,3.113250,0.603597
4,TCGA-06-0190-01A-01,6.351695,2.551934,2.991481,4.729645,4.695788,6.674928,4.204962,4.891721,2.695816,...,2.444137,3.047172,2.077277,0.928048,2.468505,1.875387,0.229834,0.000000,2.827270,2.018029
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
59,TCGA-76-4927-01A-01,6.032791,2.697929,2.486508,3.600924,3.809723,6.338483,4.089176,4.714811,3.187578,...,3.681371,3.211012,1.889084,1.171783,2.123335,1.187134,1.310689,0.000000,3.179065,2.604119
60,TCGA-76-4928-01B-01,6.496410,2.381450,2.285521,4.086308,2.995231,6.425281,3.453439,4.351275,2.183042,...,3.023078,2.557386,1.857344,0.440952,2.366308,1.622790,0.516923,0.000000,2.983185,1.395556
61,TCGA-76-4929-01A-01,6.521268,3.111098,1.758175,4.619372,3.469834,4.941332,4.311561,5.589269,2.704983,...,3.427901,2.907641,2.535580,2.031395,2.443501,1.956837,2.229649,0.000000,3.455663,3.203295
62,TCGA-76-4931-01A-01,6.414766,2.932118,2.799709,3.526820,3.366028,4.879471,4.093070,4.822628,2.103296,...,3.399718,3.740863,3.189129,1.569199,3.316349,3.349351,0.928579,0.000000,3.875239,2.824299


## Storing the Resulting Methylation and Gene Expression DataFrames

Now we can store the resulting methylation and gene expression DataFrames to the directory 'data_directory_final_datasets' by calling the function 'to.csv()' for each DataFrame.

#### Storing the 'methylation_data_T' DataFrame

In [14]:
# Defining where to save the resulting file and its name.
file_to_save = data_directory_final_datasets + "/methylation_data_final.csv"

# Saving the file.
if not os.path.exists(file_to_save):
    methylation_data_T.to_csv(file_to_save, index=False)
    print("The file with the path " + file_to_save + " has been created.")
else:
    print("The file with the path " + file_to_save + " already exists.")

The file with the path C:/Users/laure/OneDrive/Documenten/Bachelor Thesis Data/final_datasets/methylation_data_final.csv has been created.


#### Storing the 'methylation_data_M_transformed_T' DataFrame

In [15]:
# Defining where to save the resulting file and its name.
file_to_save = data_directory_final_datasets + "/methylation_data_M_transformed_final.csv"

# Saving the file.
if not os.path.exists(file_to_save):
    methylation_data_M_transformed_T.to_csv(file_to_save, index=False)
    print("The file with the path " + file_to_save + " has been created.")
else:
    print("The file with the path " + file_to_save + " already exists.")

The file with the path C:/Users/laure/OneDrive/Documenten/Bachelor Thesis Data/final_datasets/methylation_data_M_transformed_final.csv has been created.


#### Storing the 'gene_expression_data_T' DataFrame

In [16]:
# Defining where to save the resulting file and its name.
file_to_save = data_directory_final_datasets + "/gene_expression_data_final.csv"

# Saving the file.
if not os.path.exists(file_to_save):
    gene_expression_data_T.to_csv(file_to_save, index=False)
    print("The file with the path " + file_to_save + " has been created.")
else:
    print("The file with the path " + file_to_save + " already exists.")

The file with the path C:/Users/laure/OneDrive/Documenten/Bachelor Thesis Data/final_datasets/gene_expression_data_final.csv has been created.


#### Storing the 'gene_expression_data_log2_transformed_T' DataFrame

In [17]:
# Defining where to save the resulting file and its name.
file_to_save = data_directory_final_datasets + "/gene_expression_data_log2_transformed_final.csv"

# Saving the file.
if not os.path.exists(file_to_save):
    gene_expression_data_log2_transformed_T.to_csv(file_to_save, index=False)
    print("The file with the path " + file_to_save + " has been created.")
else:
    print("The file with the path " + file_to_save + " already exists.")

The file with the path C:/Users/laure/OneDrive/Documenten/Bachelor Thesis Data/final_datasets/gene_expression_data_log2_transformed_final.csv has been created.
