## Data Preparation 

We use Beeline benchmark to benchmark the performance of DeepSEM.
The data preparation process are shown in below.
1. Download raw data from https://doi.org/10.5281/zenodo.3378975, which is provided by BEELINE benchmark
2. Use the preoprocess code in https://github.com/Murali-group/Beeline/blob/master/generateExpInputs.py to generate dataset.

We also provide demo data as shown in ../demo_data/GRN_inference/input 

# Run DeepSEM by using following command:
for cell type specific GRN inference task: python main.py --task non_celltype_GRN --data_file demo_data/GRN_inference/input/500_STRING_hESC/data.csv --net_file demo_data/GRN_inference/input/500_STRING_hESC/label.csv --setting new --alpha 100 --beta 1 --n_epoch 90 --save_name out


for cell type non-specific GRN inference task: python main.py --task celltype_GRN --data_file demo_data/GRN_inference/input/500_ChIP-seq_hESC/data.csv --net_file demo_data/GRN_inference/input/500_ChIP-seq_hESC/label.csv --setting new --alpha 0.1 --beta 0.01 --n_epochs 150  --save_name out

In [None]:
! python main.py --task non_celltype_GRN --data_file demo_data/GRN_inference/input/500_STRING_hESC/data.csv --net_file demo_data/GRN_inference/input/500_STRING_hESC/label.csv --setting new --alpha 100 --beta 1 --n_epoch 90 --save_name out
!python main.py --task celltype_GRN --data_file demo_data/GRN_inference/input/500_ChIP-seq_hESC/data.csv --net_file demo_data/GRN_inference/input/500_ChIP-seq_hESC/label.csv --setting new --alpha 0.1 --beta 0.01 --n_epochs 150  --save_name out

In [6]:
! python D:/Bachelor/MinorGraduationProject/CSIPSystem/CSIP/DeepSEM/main.py --task non_celltype_GRN --data_file D:/Bachelor/MinorGraduationProject/CSIPSystem/CSIP/DeepSEM/demo_data/GRN_inference/input/500_STRING_hESC/data.csv --net_file D:/Bachelor/MinorGraduationProject/CSIPSystem/CSIP/DeepSEM/demo_data/GRN_inference/input/500_STRING_hESC/label.csv --setting new --alpha 100 --beta 1 --n_epoch 90 --save_name out

dir exist
epoch: 1 Ep: 107 Epr: 1.0450096454324784 loss: 3.3586705327033997 mse_loss: 0.9961295127868652 kl_loss: 2.3071235122624785 sparse_loss: 0.05541753148039182
epoch: 2 Ep: 104 Epr: 1.0157103095792313 loss: 3.3077706694602966 mse_loss: 0.9957329382499059 kl_loss: 2.3087346121125543 sparse_loss: 0.0033031272857139506
epoch: 4 Ep: 147 Epr: 1.435667456809106 loss: 3.2259697914123535 mse_loss: 0.8080269147952398 kl_loss: 2.4143110308796167 sparse_loss: 0.003631916013546288
epoch: 5 Ep: 158 Epr: 1.5430983549376784 loss: 3.2243102391560874 mse_loss: 0.8066638261079788 kl_loss: 2.414142973100146 sparse_loss: 0.00350345221037666
epoch: 7 Ep: 192 Epr: 1.8751574946078118 loss: 3.1519784132639566 mse_loss: 0.5288765281438828 kl_loss: 2.6165871769189835 sparse_loss: 0.006514674654075255
epoch: 8 Ep: 205 Epr: 2.0021212833052155 loss: 3.149428923924764 mse_loss: 0.5274904171625773 kl_loss: 2.6120044166843095 sparse_loss: 0.009934067182863751
epoch: 10 Ep: 211 Epr: 2.06071995501171 loss: 3.1452

# Calculate EPR values

In [15]:
import pandas as pd
output = pd.read_csv('../demo_data/GRN_inference/output/500_STRING_hESC_demo_output.tsv',sep='\t')
output['EdgeWeight'] = abs(output['EdgeWeight'])
output = output.sort_values('EdgeWeight',ascending=False)
label = pd.read_csv('../demo_data/GRN_inference/input//500_STRING_hESC/label.csv')
TFs = set(label['Gene1'])
Genes = set(label['Gene1'])| set(label['Gene2'])
output = output[output['Gene1'].apply(lambda x: x in TFs)]
output = output[output['Gene2'].apply(lambda x: x in Genes)]
label_set = set(label['Gene1']+'|'+label['Gene2'])
output= output.iloc[:len(label_set)]
len(set(output['Gene1']+'|' +output['Gene2']) & label_set) / (len(label_set)**2/(len(TFs)*len(Genes)-len(TFs)))


4.12143991002342

# Calculate AUPR ratio values

In [3]:
from sklearn.metrics import average_precision_score
import numpy as np
import pandas as pd

output = pd.read_csv('../demo_data/GRN_inference/output/500_STRING_hESC_demo_output.tsv',sep='\t')
output['EdgeWeight'] = abs(output['EdgeWeight'])
output = output.sort_values('EdgeWeight',ascending=False)
label = pd.read_csv('../demo_data/GRN_inference/input//500_STRING_hESC/label.csv')
TFs = set(label['Gene1'])
Genes = set(label['Gene1'])| set(label['Gene2'])
output = output[output['Gene1'].apply(lambda x: x in TFs)]
output = output[output['Gene2'].apply(lambda x: x in Genes)]
label_set = set(label['Gene1']+label['Gene2'])
preds,labels,randoms = [] ,[],[]
res_d = {}
l = []
p= []
for item in (output.to_dict('records')):
        res_d[item['Gene1']+item['Gene2']] = item['EdgeWeight']
for item in (set(label['Gene1'])):
        for item2 in  set(label['Gene1'])| set(label['Gene2']):
            if item+item2 in label_set:
                l.append(1)
            else:
                l.append(0)
            if item+ item2 in res_d:
                p.append(res_d[item+item2])
            else:
                p.append(-1)
average_precision_score(l,p)/np.mean(l)

2.052970499172538

# Ensemble DeepSEM result

In [None]:
res = []
for i in range(10):
    res.append(pd.read_csv('../../scGRN/Upload/GRN_inference_benchmark/cross_validation/500_STRING_hESC/rep_i.csv',sep='\t'))
res = pd.concat(res)
res['EdgeWeight'] = abs(res['EdgeWeight'])
res.groupby(['Gene1','Gene2']).mean()