# Preparint transcriptomic integration datasets for deep learning

#### As transcription integration with iMAT is sensitive to the quartile, threshold and epsilon parameters, a set of different parameters need to be explored and their impact on prediction accuracy evaluated.

Different Settings:

| Simulation # | Upper Quartile | Lower Quartile | Epsilon | Threshold |
| --- | --- | --- | --- | --- |
| 1 | 0.1 | 0.9 | 1 | 0.1 | 
| 2 | 0.2 | 0.8 | 1 | 0.1 | 
| 3 | 0.3 | 0.7 | 1 | 0.1 | 
| 4 | 0.1 | 0.9 | 1 | 0.5 | 
| 5 | 0.2 | 0.8 | 1 | 0.5 | 
| 6 | 0.3 | 0.7 | 1 | 0.5 | 
| 7 | 0.1 | 0.9 | 10 | 1 | 
| 8 | 0.2 | 0.8 | 10 | 1 | 
| 9 | 0.3 | 0.7 | 10 | 1 | 
| 10 | 0.1 | 0.9 | 10 | 5 | 
| 11 | 0.2 | 0.8 | 10 | 5 | 
| 12 | 0.3 | 0.7 | 10 | 5 | 
| 13 | 0.1 | 0.9 | 50 | 5 | 
| 14 | 0.2 | 0.8 | 50 | 5 | 
| 15 | 0.3 | 0.7 | 50 | 5 | 
| 16 | 0.1 | 0.9 | 50 | 25 | 
| 17 | 0.2 | 0.8 | 50 | 25 | 
| 18 | 0.3 | 0.7 | 50 | 25 | 
| 19 | 0.1 | 0.9 | 100 | 10 | 
| 20 | 0.2 | 0.8 | 100 | 10 | 
| 21 | 0.3 | 0.7 | 100 | 10 | 
| 22 | 0.1 | 0.9 | 100 | 50 | 
| 23 | 0.2 | 0.8 | 100 | 50 | 
| 24 | 0.3 | 0.7 | 100 | 50 | 





Downlaoding in model

In [1]:
from pyGSLModel import download_GSL_model

model = download_GSL_model()

print(f"Number of Reactions in model : {len(model.reactions)}")
print(f"Number of Metabolites in model : {len(model.metabolites)}")
print(f"Number of Genes in model : {len(model.genes)}")

print(f"Checking gene symbol conversion :")
model.genes.get_by_id("UGT8")

Downloading  and Reading in Model
Model succesfully downloaded and read in.
Number of Reactions in model : 2312
Number of Metabolites in model : 2015
Number of Genes in model : 2887
Checking gene symbol conversion :


0,1
Gene identifier,UGT8
Name,G_UGT8
Memory address,0x1c65f0e8e10
Functional,True
In 2 reaction(s),"MAR00920, MAR00919"


Generating the list of parameters for automating simulations.

In [None]:
integration_params = [{'UQ':0.1, 'LQ':0.9, 'epsilon':1, 'threshold':0.1},
                        {'UQ':0.2, 'LQ':0.8, 'epsilon':1, 'threshold':0.1},
                        {'UQ':0.3, 'LQ':0.7, 'epsilon':1, 'threshold':0.1},
                        {'UQ':0.1, 'LQ':0.9, 'epsilon':1, 'threshold':0.5},
                        {'UQ':0.2, 'LQ':0.8, 'epsilon':1, 'threshold':0.5},
                        {'UQ':0.3, 'LQ':0.7, 'epsilon':1, 'threshold':0.5},
                        {'UQ':0.1, 'LQ':0.9, 'epsilon':10, 'threshold':1},
                        {'UQ':0.2, 'LQ':0.8, 'epsilon':10, 'threshold':1},
                        {'UQ':0.3, 'LQ':0.7, 'epsilon':10, 'threshold':1},
                        {'UQ':0.1, 'LQ':0.9, 'epsilon':10, 'threshold':5},
                        {'UQ':0.2, 'LQ':0.8, 'epsilon':10, 'threshold':5},
                        {'UQ':0.3, 'LQ':0.7, 'epsilon':10, 'threshold':5},
                        {'UQ':0.1, 'LQ':0.9, 'epsilon':50, 'threshold':5},
                        {'UQ':0.2, 'LQ':0.8, 'epsilon':50, 'threshold':5},
                        {'UQ':0.3, 'LQ':0.7, 'epsilon':50, 'threshold':5},
                        {'UQ':0.1, 'LQ':0.9, 'epsilon':50, 'threshold':25},
                        {'UQ':0.2, 'LQ':0.8, 'epsilon':50, 'threshold':25},
                        {'UQ':0.3, 'LQ':0.7, 'epsilon':50, 'threshold':25},
                        {'UQ':0.1, 'LQ':0.9, 'epsilon':100, 'threshold':10},
                        {'UQ':0.2, 'LQ':0.8, 'epsilon':100, 'threshold':10},
                        {'UQ':0.3, 'LQ':0.7, 'epsilon':100, 'threshold':10},
                        {'UQ':0.1, 'LQ':0.9, 'epsilon':100, 'threshold':50},
                        {'UQ':0.2, 'LQ':0.8, 'epsilon':100, 'threshold':50},
                        {'UQ':0.3, 'LQ':0.7, 'epsilon':100, 'threshold':50}]

Performing simulations for each set of parameters and storing output dataframes in a dictionary as well as saving the csv

In [3]:
import pandas as pd
TCGA_df = pd.read_csv("./TCGA_Data/TCGA_LGG.csv")
TCGA_input = TCGA_df.set_index('sample').drop(columns=['OS.time', 'OS']).T.copy()
TCGA_input.head()

sample,TCGA-DB-5277-01,TCGA-TM-A84L-01,TCGA-TM-A7CF-02,TCGA-TM-A7CF-01,TCGA-FG-6691-01,TCGA-TQ-A7RK-02,TCGA-TQ-A7RK-01,TCGA-S9-A6TX-01,TCGA-DB-A64L-01,TCGA-TM-A7C3-01,TCGA-HT-7479-01,TCGA-DB-5279-01,TCGA-DB-A4X9-01,TCGA-FG-A4MT-02,TCGA-FG-A4MT-01,TCGA-FG-A70Y-01,TCGA-HT-A61C-01,TCGA-TM-A84Q-01,TCGA-QH-A6XC-01,TCGA-S9-A7QW-01,TCGA-QH-A870-01,TCGA-HT-7620-01,TCGA-QH-A6CX-01,TCGA-HT-7483-01,TCGA-S9-A6UB-01,TCGA-DB-5270-01,TCGA-DB-5273-01,TCGA-DB-5274-01,TCGA-DB-5276-01,TCGA-TM-A84F-01,TCGA-HT-7611-01,TCGA-DH-A7UR-01,TCGA-HT-7858-01,TCGA-TM-A7C5-01,TCGA-TM-A7C4-01,TCGA-DB-5275-01,TCGA-WY-A85B-01,TCGA-HT-7681-01,TCGA-WY-A858-01,TCGA-VM-A8C9-01,...,TCGA-RY-A843-01,TCGA-TQ-A7RI-01,TCGA-S9-A7J2-01,TCGA-QH-A6X9-01,TCGA-HT-A616-01,TCGA-HT-8010-01,TCGA-P5-A5F6-01,TCGA-HT-7680-01,TCGA-DU-A7TJ-01,TCGA-HT-7860-01,TCGA-HT-8107-01,TCGA-HT-7875-01,TCGA-CS-5394-01,TCGA-HT-7857-01,TCGA-HT-A4DS-01,TCGA-HT-8111-01,TCGA-P5-A5EV-01,TCGA-HT-7676-01,TCGA-HT-7471-01,TCGA-HT-7877-01,TCGA-HT-7691-01,TCGA-HT-8106-01,TCGA-HT-7687-01,TCGA-HT-7690-01,TCGA-HT-A74O-01,TCGA-VM-A8CB-01,TCGA-HT-A5RB-01,TCGA-HT-A5R9-01,TCGA-HT-7472-01,TCGA-P5-A5EU-01,TCGA-P5-A5ET-01,TCGA-E1-5322-01,TCGA-DU-6408-01,TCGA-DU-7007-01,TCGA-DU-6393-01,TCGA-DU-A7TG-01,TCGA-CS-6290-01,TCGA-DU-6406-01,TCGA-E1-A7YK-01,TCGA-R8-A6YH-01
A4GALT,2.015,2.348,1.385,0.4865,1.057,1.255,0.6145,1.832,1.245,2.409,2.625,1.444,0.679,1.633,0.9115,2.064,2.356,0.7664,2.521,0.2277,0.6332,1.221,1.05,2.008,0.6699,2.071,2.138,0.9493,2.683,2.826,1.105,2.744,1.434,1.202,0.9716,2.067,0.7579,1.749,2.457,3.978,...,1.345,1.628,2.036,2.118,0.6608,2.101,1.396,2.917,1.903,1.605,1.19,0.4657,0.9716,3.253,0.9493,1.848,1.799,0.6699,2.401,1.112,2.793,3.269,1.619,1.537,1.401,1.215,1.151,1.941,1.537,2.729,1.455,0.8082,1.036,1.428,1.561,2.698,1.864,1.522,3.529,0.547
ABO,-3.626,-1.685,-3.458,-2.466,-2.635,-2.351,-3.816,-1.938,-3.047,-2.053,-3.047,-1.595,-3.308,-2.635,-2.727,-3.308,-2.932,-4.293,-2.466,-2.932,-2.388,-1.283,-3.816,-1.831,-3.047,-3.458,-3.458,-3.171,-3.626,-2.245,-1.47,-1.117,-1.994,-1.318,-3.816,-0.9132,-3.171,-2.053,-2.114,-2.826,...,-3.626,-3.816,-2.635,-1.552,-3.047,-1.056,-4.035,-4.293,-2.315,0.1519,-1.938,-0.0877,-0.9686,-0.6416,-3.308,-4.293,-3.047,-2.635,-2.388,-1.47,-3.308,-1.994,-2.114,-3.308,-5.012,-3.047,-1.086,-2.635,-1.732,-3.626,-2.826,-3.308,-2.114,-2.932,-1.831,-1.086,-1.884,-2.727,-3.626,-2.826
B3GALNT1,4.098,2.478,3.074,3.798,3.819,1.647,1.669,2.824,4.593,3.843,3.739,3.962,4.422,4.546,3.739,3.943,4.692,2.144,3.977,3.66,4.348,3.965,3.289,4.277,2.439,4.545,3.609,2.511,3.853,3.702,3.831,2.322,3.055,1.546,4.843,3.641,1.86,4.646,2.359,2.245,...,3.106,3.957,3.991,3.999,4.773,4.413,1.496,4.685,4.982,5.26,4.644,4.218,3.835,3.921,3.236,4.099,3.911,4.309,4.64,3.916,4.162,3.487,4.352,4.685,1.779,3.511,4.415,3.609,4.542,5.091,3.601,4.413,4.807,3.512,4.288,4.735,3.269,5.004,2.398,3.954
B3GALT1,1.541,0.6425,0.4865,1.465,-0.1345,-0.1504,-0.7346,-2.245,1.373,2.098,-0.5543,1.915,2.328,1.938,-1.248,0.2277,1.221,0.4552,1.967,0.4761,1.434,-0.5125,0.3796,0.6332,-0.5756,1.907,-1.149,-1.149,1.368,1.774,-1.026,0.8883,-0.6193,-1.51,1.872,1.496,-0.8339,1.71,-1.355,0.03,...,0.8726,-0.1828,0.3796,2.683,2.251,1.808,-4.293,-1.831,3.183,2.556,2.519,1.956,-2.727,2.074,1.264,0.1124,-0.5125,-0.1993,-0.5332,0.9862,0.044,-0.8863,0.547,2.06,-3.047,2.224,1.158,0.346,0.7748,1.245,-0.6643,0.6332,0.7999,-0.6416,2.447,2.322,-0.1828,3.536,0.2029,-1.392
B3GALT4,1.008,2.39,2.196,1.043,2.519,1.571,0.6145,2.623,2.217,2.236,2.506,3.006,2.58,1.836,1.99,2.509,2.759,1.84,3.109,2.514,2.868,2.649,2.199,0.6425,2.874,2.337,2.412,2.514,2.995,2.82,2.602,2.676,2.022,1.896,1.328,2.025,1.971,2.749,2.0,3.788,...,2.199,2.193,2.986,2.233,1.401,2.806,2.376,3.256,2.105,2.559,2.387,1.975,2.725,3.687,3.268,2.658,2.281,2.509,3.106,2.514,3.131,3.657,2.808,2.196,2.658,3.084,1.071,1.491,2.504,2.793,2.772,2.774,1.001,2.642,2.938,2.585,2.592,3.424,3.519,1.125


In [5]:
TCGA_Surv = TCGA_df[['sample', 'OS', 'OS.time']].copy()
TCGA_Surv = TCGA_Surv.set_index('sample')
TCGA_Surv.head()

Unnamed: 0_level_0,OS,OS.time
sample,Unnamed: 1_level_1,Unnamed: 2_level_1
TCGA-DB-5277-01,1.0,1547.0
TCGA-TM-A84L-01,1.0,1242.0
TCGA-TM-A7CF-02,0.0,1989.0
TCGA-TM-A7CF-01,0.0,1989.0
TCGA-FG-6691-01,0.0,1257.0


In [None]:
from pyGSLModel import iMAT_multi_integrate

i=0
for params in integration_params:
    i += 1
    iMAT_simulated = iMAT_multi_integrate(model=model, 
                                   data=TCGA_input, 
                                   upper_quantile=params["UQ"], 
                                   lower_quantile=params["LQ"], 
                                   epsilon=params["epsilon"], 
                                   threshold=params["threshold"])
    iMAT_df = iMAT_simulated[0]
    TCGA_Tidied_Data = pd.merge(TCGA_input.T,iMAT_df,left_index=True,right_index=True)
    TCGA_Tidied_Data = pd.merge(TCGA_Surv, TCGA_Tidied_Data, right_index=True, left_index=True)
    TCGA_Tidied_Data.to_csv(f"./iMAT_integrated_data/TCGA_iMAT_integrated_df_{i}.csv")
    print("")
    print(f"--------------------------\nParameters set {i} completed\n--------------------------")
    print("")

Simulations Performed:1/523
Simulations Performed:2/523
Simulations Performed:3/523
Simulations Performed:4/523
Simulations Performed:5/523
Simulations Performed:6/523
Simulations Performed:7/523
Simulations Performed:8/523
Simulations Performed:9/523
Simulations Performed:10/523
Simulations Performed:11/523
Simulations Performed:12/523
Simulations Performed:13/523
Simulations Performed:14/523
Simulations Performed:15/523
Simulations Performed:16/523
Simulations Performed:17/523
Simulations Performed:18/523
Simulations Performed:19/523
Simulations Performed:20/523
Simulations Performed:21/523
Simulations Performed:22/523
Simulations Performed:23/523
Simulations Performed:24/523
Simulations Performed:25/523
Simulations Performed:26/523
Simulations Performed:27/523
Simulations Performed:28/523
Simulations Performed:29/523
Simulations Performed:30/523
Simulations Performed:31/523
Simulations Performed:32/523
Simulations Performed:33/523
Simulations Performed:34/523
Simulations Performed:3