## Instructions

This jupyter notebook run MADRID pipeline to identify drug targets and repurposing drugs for user-defined complex human diseases. The entire process contains four steps:
1. Download and analyze transcriptomics and proteomics data, output a list of active genes.
2. Create tissue specific models based on the list of active genes. If required the user can manually refine these models and supply them in Step 4. 
3. Identifying differential gene expressions from disease datasets.
4. Identifying drug targets and repruposable drugs. This step consists of four substeps. 
 (i) mapping drugs on automatically created or user-supplied models, (ii) knock-out simulation, (iii) compare simulation results of perturbed and unperturbed models, and (iv) integrate with disease genes and score drug targets.

The users needs to create the input files for each step and upload input files to the docker container `/root/pipelines/data/`, and specify the input files in this notebook. In the original docker image, some exemplary input files are included to build metabolic models of naive, Th1, Th2, and Th17 subtypes and identify drug targets for rheumatoid arthritis. User should follow the documentation and the format of the exemplary input files to create your own input files.

In [2]:
# import necessary python packages
import sys
import os
import pandas
import numpy
import json
import re
from subprocess import call
from project import configs


# print root path of the project
print(configs.rootdir) 

/home/jupyteruser/work


## Step 1: Identifying gene activity by analyzing transcriptomics and proteomics datasets

*** Specify input files for step 1 here ***

If proteomics data is not availabe, use:

proteomics_data_file = 'dummy_proteomics_data.xlsx'

proteomics_config_file = 'dummy_proteomics_config.xlsx'

In [3]:
# Step 0: Preprocess bulk RNAseq dat by generate count matrix from gene counts files
# generated from STAR and/or fetching necessary gene info from BioDBnet

technique = "quantile" # technique for bulk RNA-seq active gene determination
                        # for count matrix gen, only used to determine whether or not
                        # picard output mean fragment sizes are required.

tissue_name = "NaiveB"
create_counts_matrix = True # set to false if using a pregenerated matrix file
gene_format = "Ensembl" # accepts 'Entrez', 'Ensembl', and 'Symbol'
    
cmd = ' '.join(['python3', 'bulkRNAPreprocess.py',
                '-n', '"{}"'.format(tissue_name),
                '-c', '"{}"'.format(create_counts_matrix),
                '-f', '"{}"'.format(gene_format),
                '-t', '"{}"'.format(technique)])
!{cmd}

Creating directory /home/jupyteruser/.config/bioservices 
[0;1;31mSystem has not been booted with systemd as init system (PID 1). Can't operate.[0m
[0;1;31mFailed to create bus connection: Host is down[0m
['bulkRNAPreprocess.py', '-n', 'NaiveB', '-c', 'True', '-f', 'Ensembl', '-t', 'quantile']
Input directory is "/home/jupyteruser/work/data/bulkData/NaiveB"
Output directory is "/home/jupyteruser/work/data"
Active gene determination technique is "quantile"
Creating Counts Matrix
Count Matrix written at  /home/jupyteruser/work/data/BulkRNAseqDataMatrix_NaiveB.csv 
Fetching gene info using genes in "/home/jupyteruser/work/data/BulkRNAseqDataMatrix_NaiveB.csv"
Creating directory /home/jupyteruser/.cache/bioservices 
Welcome to Bioservices
It looks like you do not have a configuration file.
We are creating one with default values in /home/jupyteruser/.config/bioservices/bioservices.cfg .
Done
retrieve 0:500
retrieve 500:1000
retrieve 1000:1500
retrieve 1500:2000
retrieve 2000:2500
retri

In [4]:
# wd for development

# Specific input files for step 1

# config file for transcriptomics (microarray)
transcriptomics_config_file = 'transcriptomics_data_inputs.xlsx'

# data file for bulk rna-seq
bulk_data_file = 'BulkRNAseqDataMatrix_NaiveB.csv'

# config for bulk rna-seq
bulk_config_file = 'bulk_data_inputs_test.csv'

# gene info file for bulk rna-seq
gene_info_file = 'GeneInfo_NaiveB.csv'

# data file for proteomics
proteomics_data_file = 'ProteomicsDataMatrix.xlsx' 

# config file for proteomics
proteomics_config_file = 'proteomics_data_inputs.xlsx'

# ratio of replicates required for a gene to be considered active in that sample
expression_proportion = 0.5

# Genes can be considered high confidence (labeled as 'top') if they are expressed
# in a high proportion of samples. High confidence genes will be considered expressed
# regardless of agreement with other data sources
top_proportion = 0.9

In [10]:
# Step 1.1 Download and analyze transcriptomics
cmd = ' '.join(['python3', 'transcriptomic_gen.py', 
      '-i', '"{}"'.format(transcriptomics_config_file),
      '-e', '"{}"'.format(expression_proportion),
      '-t', '"{}"'.format(top_proportion)])
!{cmd}

Input file is  transcriptomics_data_inputs.xlsx
Expression Proportion for Gene Expression is  0.5
Top proportion for high-confidence genes is  0.9
/home/jupyteruser/work/data/transcriptomics_data_inputs.xlsx
---
Start Collecting Data for:
['GSE22886' 'GSE43005' 'GSE22045' 'GSE24634']
['GSM565273' 'GSM565274' 'GSM565275' 'GSM565290' 'GSM565291' 'GSM565292'
 'GSM1054773' 'GSM1054779' 'GSM1054781' 'GSM1054789' 'GSM548000'
 'GSM548001' 'GSM607510' 'GSM607511' 'GSM607512']
---

Initialize project (GSE22886):
Root: /home/jupyteruser/work
Raw data: /home/jupyteruser/work/data/GSE22886_RAW
Sample exist: /home/jupyteruser/work/data/GSE22886_RAW/GSM565273.tar
Sample exist: /home/jupyteruser/work/data/GSE22886_RAW/GSM565274.tar
Sample exist: /home/jupyteruser/work/data/GSE22886_RAW/GSM565275.tar
Sample exist: /home/jupyteruser/work/data/GSE22886_RAW/GSM565290.tar
Sample exist: /home/jupyteruser/work/data/GSE22886_RAW/GSM565291.tar
Sample exist: /home/jupyteruser/work/data/GSE22886_RAW/GSM565292.t

In [6]:
# step 1.2 Analyze Bulk-RNA-seq 

# Bulk-RNA-seq can handle many more parameters, 
# bulk_data_file, bulk_config_file, gene_format, and species_dataset are required.

exp_prop_rep = 0.5  # proportion of replicates for a gene to be active in a sample
exp_prop_samp = 0.5 # proportion of samples with expression required for gene  
top_prop_rep = 0.9
top_prop_samp = 0.9 
technique = "quantile" 
quantile = 25 

cmd = ' '.join(['python3', 'bulk_gen.py', 
      '-i', '"{}"'.format(bulk_data_file),   # bulk rna-seq data sheet (required)
      '-c', '"{}"'.format(bulk_config_file), # config file for bulk RNA-seq (required)
      '-g', '"{}"'.format(gene_info_file),   # gene info file 
      '-r', '"{}"'.format(exp_prop_rep),     # proportion of replicates for a gene to be active in a sample
      '-s', '"{}"'.format(exp_prop_samp),    # proportion of studies with expression required for gene       
      '-x', '"{}"'.format(top_prop_rep),     # top proportion replicates
      '-y', '"{}"'.format(top_prop_samp),    # top proportion studies
      '-t', '"{}"'.format(technique),        # technique for filtering and normalization    
      '-q', '"{}"'.format(quantile)])        # cutoff TPM quantile for quantile techique
                
!{cmd}

[0;1;31mSystem has not been booted with systemd as init system (PID 1). Can't operate.[0m
[0;1;31mFailed to create bus connection: Host is down[0m
Data file is "BulkRNAseqDataMatrix_NaiveB.csv"
Supplementary Data file is "bulk_data_inputs_test.csv"
Gene info file is "GeneInfo_NaiveB.csv"
Bulk_Naive.csv
Output File is "/home/jupyteruser/work/data/Bulk_Naive.csv"
[1] "Reading Counts Matrix"
[1] "Filtering Counts"
Test data saved to /home/jupyteruser/work/data/Bulk_Naive.csv

R[write to console]: 1: 
R[write to console]: In (function (package, help, pos = 2, lib.loc = NULL, character.only = FALSE,  :
R[write to console]: 
 
R[write to console]:  library ‘/usr/lib/R/site-library’ contains no packages

R[write to console]: 2: 
R[write to console]: In (function (package, help, pos = 2, lib.loc = NULL, character.only = FALSE,  :
R[write to console]: 
 
R[write to console]:  library ‘/usr/lib/R/site-library’ contains no packages

R[write to console]: 3: 
R[write to console]: In (function (

In [7]:
# Step 1.3 Analyze proteomics
percentile = 25

cmd = ' '.join(['python3', 'proteomics_gen.py', 
      '-d', '"{}"'.format(proteomics_data_file), 
      '-s', '"{}"'.format(proteomics_config_file),
      '-e', '"{}"'.format(expression_proportion),
      '-t', '"{}"'.format(top_proportion),
      '-p', '"{}"'.format(percentile)])
!{cmd}

Data file is "ProteomicsDataMatrix.xlsx"
Supplementary Data file is "proteomics_data_inputs.xlsx"
                                Naïve 
0     CopyNumber_T4.naive_01_activated
1     CopyNumber_T4.naive_02_activated
2     CopyNumber_T4.naive_03_activated
3     CopyNumber_T4.naive_04_activated
4  CopyNumber_T4.naive_01_steady-state
5  CopyNumber_T4.naive_02_steady-state
6  CopyNumber_T4.naive_03_steady-state
7  CopyNumber_T4.naive_04_steady-state
Test Data Saved to /home/jupyteruser/work/data/Proteomics_Naive.csv

R[write to console]: 1: 
R[write to console]: In (function (package, help, pos = 2, lib.loc = NULL, character.only = FALSE,  :
R[write to console]: 
 
R[write to console]:  library ‘/usr/lib/R/site-library’ contains no packages

R[write to console]: 2: 
R[write to console]: In (function (package, help, pos = 2, lib.loc = NULL, character.only = FALSE,  :
R[write to console]: 
 
R[write to console]:  library ‘/usr/lib/R/site-library’ contains no packages



In [21]:
# Step 1.4 Merge the gene lists of transcriptomics and proteomics, create a list of active gene IDs

expression_requirement=1 # number of data souces with expression required for a gene
                       # to be considered active if not a top gene for any source
                       # (defaults to the total number of input data sources)

cmd = ' '.join(['python3', 'merge_xomics.py', 
      #'-t', '"{}"'.format(transcriptomics_config_file),
      '-b', '"{}"'.format(bulk_config_file),
      '-p', '"{}"'.format(proteomics_config_file),
      '-r', '"{}"'.format(expression_requirement)])
!{cmd}

[0;1;31mSystem has not been booted with systemd as init system (PID 1). Can't operate.[0m
[0;1;31mFailed to create bus connection: Host is down[0m
Transcriptomics file is "None"
Proteomics file is "proteomics_data_inputs.xlsx"
Bulk RNA-seq file is "bulk_data_inputs_test.csv"
FILENAME
None
                                Naïve 
0     CopyNumber_T4.naive_01_activated
1     CopyNumber_T4.naive_02_activated
2     CopyNumber_T4.naive_03_activated
3     CopyNumber_T4.naive_04_activated
4  CopyNumber_T4.naive_01_steady-state
5  CopyNumber_T4.naive_02_steady-state
6  CopyNumber_T4.naive_03_steady-state
7  CopyNumber_T4.naive_04_steady-state
proteomics exists
Naïve 
Test Data Load From /home/jupyteruser/work/data/Proteomics_Naive.csv
{'Naive':                 CopyNumber_T4.naive_01_activated  ...  top
ENTREZ_GENE_ID                                    ...     
1                                          42776  ...    0
2                                         341085  ...    1
144568         

## Step 2: Create tissue-specific or cell-type-specific Models

In [24]:
# Load the output of step 1, which is a dictionary that specifies the merged list of active Gene IDs for each tissue

step1_results_file = os.path.join(configs.rootdir, 'data', 'step1_results_files.json')
with open(step1_results_file) as json_file:
    tissue_gene_exp = json.load(json_file)
print(tissue_gene_exp)

{'Naive': '/home/jupyteruser/work/data/GeneExpression_Naive_Merged.csv'}


*** Specify input files for step 2 here ***

In [25]:
# (input) filename of General Model, Recon3D_Teff_ver2
GeneralModelFile = 'GeneralModel.mat'

# (input) filename of Tissue Gene Expression
# genefile = 'merged_Th1.csv'

# (output) filename of Tissue Specific Model
# tissuefile = 'Th1_SpecificModel.mat'

In [26]:
# create tissue specific model, the names of output files are stored in dictionary tissue_spec_model
tissue_spec_model = {}
reconAlgorithm = "GIMME" # troppo reconstruction algorithm to use

for key,value in tissue_gene_exp.items():
    tissuefile = '{}_SpecificModel.mat'.format(key)
    tissue_spec_model[key] = tissuefile
    tissue_gene_file = re.split('/|\\\\', value)[-1]
    tissue_gene_folder = os.path.join(configs.rootdir, 'data', key)
    os.makedirs(tissue_gene_folder, exist_ok=True)
    cmd = ' '.join(['python3', 'create_tissue_specific_model.py', 
                      '-m', '"{}"'.format(GeneralModelFile), 
                      '-g', '"{}"'.format(tissue_gene_file),
                      '-o', '"{}"'.format(tissuefile),
                      '-a', '"{}"'.format(reconAlgorithm)])
    !{cmd}

print(tissue_spec_model)

General Model file is "GeneralModel.mat"
Gene Expression file is "GeneExpression_Naive_Merged.csv"
Output file is "Naive_SpecificModel.mat"
Using "GIMME" reconstruction algorithm
Scaling...
 A: min|aij| =  1.000e+00  max|aij| =  1.000e+00  ratio =  1.000e+00
Problem data seem to be well scaled
(25713, 2)
(0, 2)
(0, 2)
(25713, 2)
Map gene expression to reactions, 0 errors.
OrderedDict([('AGTim', 0), ('AGTix', 0), ('ALAR', 1), ('ARGSL', 1), ('ARGSS', 0), ('ASNNm', 0), ('ASNS1', 1), ('ASPNATm', -1.0), ('ASPTA', 1), ('ASPTAm', 1), ('DASPO1p', 0), ('NACASPAH', 0), ('r0127', 0), ('COKECBESr', 0), ('ACGALK', -1.0), ('ACGALK2', -1.0), ('ACGAM2E', 1), ('ACGAM6PSi', 1), ('ACGAMK', 1), ('ACGAMPM', 1), ('ACNAM9PL', 1), ('ACNAM9PL2', 1), ('ACNAMPH', 1), ('ACNMLr', -1.0), ('AGDC', 1), ('AMANK', 1), ('CHTNASE', 1), ('CHTNASEe', 0), ('CMPSAS', 1), ('CMPSASn', 1), ('G6PDA', 1), ('GF6PTA', 1), ('HEX10', 1), ('HMR_4124', 1), ('KDNH', -1.0), ('r0013', 0), ('r0113', 1), ('r0363', 1), ('r0364', 1), ('r0782'

## Step 3: Identifying disease related genes by analyzing transcriptomics data of patients
Differential Expression Analysis

Only 1 disease to be analyzed, output files in data folder

*** Specify input files for step 3 here ***

In [23]:
#input filename transcriptomics data of disease
disease_gene_file = 'disease_transcriptomics_data_inputs.xlsx'

In [26]:
# load the results of step 3 to dictionary 'disease_files'
step3_results_file = os.path.join(configs.datadir, 'step2_results_files.json')
with open(step3_results_file) as json_file:
    disease_files = json.load(json_file)
print(disease_files)

{'GSE': 'GSE56649', 'UP_Reg': '/home/jupyteruser/work/data/Disease_UP_GSE56649.txt', 'DN_Reg': '/home/jupyteruser/work/data/Disease_DOWN_GSE56649.txt', 'RAW_Data': '/home/jupyteruser/work/data/Raw_Fit_GSE56649.csv'}


In [37]:
# Differential gene expression analysis
cmd = ' '.join(['python3', 'disease_analysis.py', 
              '-i', '"{}"'.format(disease_gene_file)])
!{cmd}

Input file is " disease_transcriptomics_data_inputs.xlsx
Initialize project (GSE56649):
Root: /home/jupyteruser/work
Raw data: /home/jupyteruser/work/data/GSE56649_RAW
Sample exist: /home/jupyteruser/work/data/GSE56649_RAW/GSM1366348.tar
Sample exist: /home/jupyteruser/work/data/GSE56649_RAW/GSM1366349.tar
Sample exist: /home/jupyteruser/work/data/GSE56649_RAW/GSM1366350.tar
Sample exist: /home/jupyteruser/work/data/GSE56649_RAW/GSM1366351.tar
Sample exist: /home/jupyteruser/work/data/GSE56649_RAW/GSM1366352.tar
Sample exist: /home/jupyteruser/work/data/GSE56649_RAW/GSM1366353.tar
Sample exist: /home/jupyteruser/work/data/GSE56649_RAW/GSM1366354.tar
Sample exist: /home/jupyteruser/work/data/GSE56649_RAW/GSM1366355.tar
Sample exist: /home/jupyteruser/work/data/GSE56649_RAW/GSM1366356.tar
Sample exist: /home/jupyteruser/work/data/GSE56649_RAW/GSM1366357.tar
Sample exist: /home/jupyteruser/work/data/GSE56649_RAW/GSM1366358.tar
Sample exist: /home/jupyteruser/work/data/GSE56649_RAW/GSM1366

## Step 4: Identification of drug targets and repurposable drugs
This step maps drug targets in metabolic models,prforms knock out simulation, and compare simulation results with disease genes and identifies drug targets and repurposable drugs

*** Specify input files for step 4 here ***

1. Instruction: A processed Drug-Target file is included in the `/root/pipelines/data/`. (Optional step) For the updated versions the users can download `Repurposing_Hub_export.txt` from [Drug Repurposing Hub](https://clue.io/repurposing-app). From the downloaded file first remove all the activators, agonists, and withdrawn drugs and then upload to to `/root/pipelines/data/`.

2. To use automatically created tissue specific models. Note: It is recommended to use refined and validated models for further analysis. User can define cutomized models in next sub-step.

In [None]:
# tissue specific models
tissue_spec_model

In [None]:
Disease_Down = disease_files['DN_Reg']
Disease_Up = disease_files['UP_Reg']
drug_raw_file = 'Repurposing_Hub_export.txt'

3. To use customized model, please specify `tissue_spec_model` manually, e.g. uncomment tissue_spec_model in the following cell.

In [None]:
# Manually specify Up and Down Regulated Genes for Disease. (Please upload manually created files `/pipelines/data/`. Use filenames as given belwo or change them accordingly.)
# Disease_Down = 'Disease_DOWN.txt'
# Disease_Up = 'Disease_UP.txt'
# drug_raw_file = 'Repurposing_Hub_export.txt'

# Manually specify tissue specific models fine-tuned by user. Change names of the files accordingly. Users can use single or multiple models here. Using multiple models, simulation time will increase.
# tissue_spec_model = {'Th1':'Th1Model.mat',
#                      'Th2':'Th2Model.mat',
#                      'Th17':'Th17Model.mat',
#                      'Naive':'NaiveModel.mat'}

# Manually specify tissue specific model created by matlab cobratoolbox. For example run, we have provided four models of CD4+ T cells (niave, Th1, Th2, and Th17) please uncomment all or any specific model
# tissue_spec_model = {'Th1':'Th1_SpecificModel_matlab.mat',
#                      'Th2':'Th2_SpecificModel_matlab.mat',
#                      'Th17':'Th17_SpecificModel_matlab.mat',
#                      'Naive':'Naive_SpecificModel_matlab.mat'}


In [None]:
# Knock out simulation for the analyzed tissues
for key,value in tissue_spec_model.items():
    tissueSpecificModelfile = value
    tissue_gene_folder = os.path.join(configs.datadir, key)
    os.makedirs(tissue_gene_folder, exist_ok=True)
    inhibitors_file = '{}_inhibitors_Entrez.txt'.format(key)
    cmd = ' '.join(['python3' , 'knock_out_simulation.py',
                  '-t', tissueSpecificModelfile,
                  '-i', inhibitors_file,
                  '-u', Disease_Up,
                  '-d', Disease_Down,
                  '-f', key,
                  '-r', drug_raw_file])
    !{cmd}
    
    # copy generated output to output folder
    cmd = ' '.join(['cp', '-a', os.path.join(configs.datadir, key), configs.outputdir])
    !{cmd}
    #break
