## Instructions

This jupyter notebook run MADRID pipeline to identify drug targets and repurposing drugs for user-defined complex human diseases. The entire process contains four steps:
1. Download and analyze transcriptomics and proteomics data, output a list of active genes.
2. Create tissue specific models based on the list of active genes. If required the user can manually refine these models and supply them in Step 4. 
3. Identifying differential gene expressions from disease datasets.
4. Identifying drug targets and repruposable drugs. This step consists of four substeps. 
 (i) mapping drugs on automatically created or user-supplied models, (ii) knock-out simulation, (iii) compare simulation results of perturbed and unperturbed models, and (iv) integrate with disease genes and score drug targets.

The users needs to create the input files for each step and upload input files to the docker container `/root/pipelines/data/`, and specify the input files in this notebook. In the original docker image, some exemplary input files are included to build metabolic models of naive, Th1, Th2, and Th17 subtypes and identify drug targets for rheumatoid arthritis. User should follow the documentation and the format of the exemplary input files to create your own input files.

In [1]:
# import necessary python packages
import sys
import os
import pandas
import numpy
import json
import re
from subprocess import call
from project import configs


# print root path of the project
print(configs.rootdir) 

G:/GitHub/MADRID/docker/pipelines/


## Step 1: Identifying gene activity by analyzing transcriptomics and proteomics datasets

*** Specify input files for step 1 here ***

If proteomics data is not availabe, use:

proteomics_data_file = 'dummy_proteomics_data.xlsx'

proteomics_config_file = 'dummy_proteomics_config.xlsx'

In [9]:
# Step 0: Generate count matrix from gene counts files generated from STAR

technique = "quantile" # technique for bulk RNA-seq active gene determination
                        # for count matrix gen, only used to determine whether or not
                        # picard output mean fragment sizes are required.
        
input_dir = "docker/pipelines/data/bulkData/NaiveB/"
output_dir = "docker/pipelines/data/"
        
cmd = ' '.join(['python3', 'generateCountMatrix.py',
                '-i', '"{}"'.format(input_dir),
                '-o', '"{}"'.format(output_dir),
                '-t', '"{}"'.format(technique)])
!{cmd}
# Alternatively, the gene count matrix for RNA-seq can be crafted any other way desired
# and this step can be skipped

['generateCountMatrix.py', '-i', 'G:/GitHub/MADRID/docker/pipelines/data/bulkData/NaiveB/', '-o', 'G:/GitHub/MADRID/docker/pipelines/data/', '-t', 'quantile']
Input directory is "G:/GitHub/MADRID/docker/pipelines/data/bulkData/NaiveB/"
Output file is "G:/GitHub/MADRID/docker/pipelines/data/"
Active gene determination technique is "quantile"


In [2]:
# wd for development

# Specific input files for step 1

# config file for transcriptomics (microarray)
transcriptomics_config_file = 'transcriptomics_data_inputs.xlsx'

# data file for bulk rna-seq
bulk_data_file = 'BulkRNAseqDataMatrix.csv'

# config for bulk rna-seq
bulk_config_file = 'bulk_data_inputs.csv'

# data file for proteomics
proteomics_data_file = 'ProteomicsDataMatrix.xlsx' 

# config file for proteomics
proteomics_config_file = 'proteomics_data_inputs.xlsx'

# proportion of replicates required for a gene to be considered active in that sample

expression_proportion = 0.5
# if gene is in the top nth percentile in any sample it is considered high confidence and will be considered
# expressed regardless of the results of other methods
top_percentile =  10

In [4]:
# Step 1.1 Download and analyze transcriptomics
cmd = ' '.join(['python3', 'transcriptomic_gen.py', 
      '-i', '"{}"'.format(transcriptomics_config_file)])
!{cmd}

'python3' is not recognized as an internal or external command,
operable program or batch file.


In [15]:
# step 1.2 Analyze Bulk-RNA-seq 

# Bulk-RNA-seq can handle many more parameters, 
# bulk_data_file, bulk_config_file, gene_format, and species_dataset are required.

gene_format = "ensembl" # gene format in count file for biomart
species_dataset = "human" # species dataset for biomart
exp_prop_rep = 0.5  # proportion of replicates for a gene to be active in a sample
exp_prop_samp = 0.5 # proportion of samples with expression required for gene   
top_percentile = 5 # any replicate with expression in this percentile is expressed, regardless of other sources  
technique = "quantile" # quantile, cpm, or zFPKM
quantile = 25 # only used with quantile

cmd = ' '.join(['python3', 'bulk_gen.py', 
      '-f', '"{}"'.format(bulk_data_file),   # bulk rna-seq data sheet (required)
      '-c', '"{}"'.format(bulk_config_file), # config file for bulk RNA-seq (required)
      '-g', '"{}"'.format(gene_format),      # gene format in count file for biomart (required)
      '-d', '"{}"'.format(species_dataset),  # species dataset for biomart (required)
      '-r', '"{}"'.format(exp_prop_rep),     # proportion of replicates for a gene to be active in a sample
      '-s', '"{}"'.format(exp_prop_samp),    # proportion of samples with expression required for gene       
      '-p', '"{}"'.format(top_percentile),   # top percentile 
      '-t', '"{}"'.format(technique),        # technique for filtering and normalization    
      '-q', '"{}"'.format(quantile)])         # cutoff TPM quantile for quantile techique
                
!{cmd}

Data file is "BulkRNAseqDataMatrix.csv"
Supplementary Data file is "bulk_data_inputs.csv"
G:/GitHub/MADRID/docker/pipelines/data\Bulk_CD8T.csv
Test data saved to G:/GitHub/MADRID/docker/pipelines/data\Bulk_CD8T.csv


  from pandas.core.index import Index as PandasIndex


In [6]:
# Step 1.3 Analyze proteomics
cmd = ' '.join(['python3', 'proteomics_gen.py', 
      '-d', '"{}"'.format(proteomics_data_file), 
      '-s', '"{}"'.format(proteomics_config_file),
      '-e', '"{}"'.format(expression_proportion),
      '-p', '"{}"'.format(top_percentile)])
!{cmd}

Data file is "ProteomicsDataMatrix.xlsx"
Supplementary Data file is "proteomics_data_inputs.xlsx"
                                Naïve 
0     CopyNumber_T4.naive_01_activated
1     CopyNumber_T4.naive_02_activated
2     CopyNumber_T4.naive_03_activated
3     CopyNumber_T4.naive_04_activated
4  CopyNumber_T4.naive_01_steady-state
5  CopyNumber_T4.naive_02_steady-state
6  CopyNumber_T4.naive_03_steady-state
7  CopyNumber_T4.naive_04_steady-state
Test Data Saved to G:/GitHub/MADRID/docker/pipelines/data\Proteomics_Naive.csv


  from pandas.core.index import Index as PandasIndex


In [16]:
# Step 1.4 Merge the gene lists of transcriptomics and proteomics, create a list of active gene IDs

cmd = ' '.join(['python3', 'merge_xomics.py', 
      #'-t', '"{}"'.format(transcriptomics_config_file),
      '-b', '"{}"'.format(bulk_config_file)])
      #'-p', '"{}"'.format(proteomics_config_file),])
!{cmd}

Transcriptomics file is "None"
Proteomics file is "None"
Bulk RNA-seq file is "bulk_data_inputs.csv"
{'dummy': 'dummy_data'}
  SampleName InsertSize
0   FILENAME       CD8T
1   GROUP_S1   GROUP_S1
2  CD8T_rep1          0
3  CD8T_rep2          0
4  CD8T_rep3          0
5  CD8T_rep4          0
6  CD8T_rep5          0
7  CD8T_rep6          0
8  CD8T_rep7          0
9  CD8T_rep8          0
  SampleName InsertSize
0   FILENAME       CD8T
0    CD8T
Name: InsertSize, dtype: object
bulk exists
CD8T
expressed    int64
top          int64
dtype: object
Test Data Load From G:/GitHub/MADRID/docker/pipelines/data\Bulk_CD8T.csv
{'CD8T':                 expressed  top
ENTREZ_GENE_ID                
7105                    0    0
64102                   0    0
8813                    1    0
57147                   0    0
55732                   0    0
...                   ...  ...
146512                  0    0
101929608               0    0
100129503               0    0
84931                   0    

  from pandas.core.index import Index as PandasIndex


## Step 2: Create tissue-specific or cell-type-specific Models

In [17]:
# Load the output of step 1, which is a dictionary that specifies the merged list of active Gene IDs for each tissue

step1_results_file = os.path.join(configs.rootdir, 'data', 'step1_results_files.json')
with open(step1_results_file) as json_file:
    tissue_gene_exp = json.load(json_file)
print(tissue_gene_exp)

{'CD8T': 'G:/GitHub/MADRID/docker/pipelines/data\\GeneExpression_CD8T_Merged.csv'}


*** Specify input files for step 2 here ***

In [18]:
# (input) filename of General Model, Recon3D_Teff_ver2
GeneralModelFile = 'GeneralModel.mat'

# (input) filename of Tissue Gene Expression
# genefile = 'merged_Th1.csv'

# (output) filename of Tissue Specific Model
# tissuefile = 'Th1_SpecificModel.mat'

In [19]:
# create tissue specific model, the names of output files are stored in dictionary tissue_spec_model
tissue_spec_model = {}
reconAlgorithm = "GIMME" # troppo reconstruction algorithm to use

for key,value in tissue_gene_exp.items():
    tissuefile = '{}_SpecificModel.mat'.format(key)
    tissue_spec_model[key] = tissuefile
    tissue_gene_file = re.split('/|\\\\', value)[-1]
    tissue_gene_folder = os.path.join(configs.rootdir, 'data', key)
    os.makedirs(tissue_gene_folder, exist_ok=True)
    cmd = ' '.join(['python3', 'create_tissue_specific_model.py', 
                      '-m', '"{}"'.format(GeneralModelFile), 
                      '-g', '"{}"'.format(tissue_gene_file),
                      '-o', '"{}"'.format(tissuefile),
                      '-a', '"{}"'.format(reconAlgorithm)])
    !{cmd}

print(tissue_spec_model)

General Model file is "GeneralModel.mat"
Gene Expression file is "GeneExpression_CD8T_Merged.csv"
Output file is "CD8T_SpecificModel.mat"
{'CD8T': 'CD8T_SpecificModel.mat'}


  'Will not normalize rules with more than ' + str(token_to_gene_ratio) + ' average tokens per gene')


Using "GIMME" reconstruction algorithm
(24860, 2)
(0, 2)
(0, 2)
(24860, 2)
Map gene expression to reactions, 0 errors.
OrderedDict([('AGTim', 0), ('AGTix', 0), ('ALAR', 0), ('ARGSL', 0), ('ARGSS', 0), ('ASNNm', 0), ('ASNS1', 1), ('ASPNATm', -1.0), ('ASPTA', 0), ('ASPTAm', 1), ('DASPO1p', 0), ('NACASPAH', 0), ('r0127', 0), ('COKECBESr', 0), ('ACGALK', -1.0), ('ACGALK2', -1.0), ('ACGAM2E', 0), ('ACGAM6PSi', 0), ('ACGAMK', 0), ('ACGAMPM', 0), ('ACNAM9PL', 1), ('ACNAM9PL2', 1), ('ACNAMPH', 0), ('ACNMLr', -1.0), ('AGDC', 0), ('AMANK', 0), ('CHTNASE', 0), ('CHTNASEe', 0), ('CMPSAS', 0), ('CMPSASn', 0), ('G6PDA', 0), ('GF6PTA', 1), ('HEX10', 1), ('HMR_4124', 1), ('KDNH', -1.0), ('r0013', 1), ('r0113', 0), ('r0363', 1), ('r0364', 1), ('r0782', 0), ('r1374', 0), ('r1375', -1.0), ('UAG2EMAi', 0), ('UAG4E', 0), ('UAGALDP', -1.0), ('UAGDP', 1), ('HMR_1944', 0), ('HMR_1958', 0), ('HMR_1962', 0), ('HMR_1967', 0), ('HMR_1968', 0), ('HMR_1970', 0), ('HMR_1971', 0), ('HMR_1976', 0), ('HMR_1982', 0), ('

## Step 3: Identifying disease related genes by analyzing transcriptomics data of patients
Differential Expression Analysis

Only 1 disease to be analyzed, output files in data folder

*** Specify input files for step 3 here ***

In [2]:
#input filename transcriptomics data of disease
disease_gene_file = 'disease_transcriptomics_data_inputs.xlsx'

In [4]:
# Differential gene expression analysis
cmd = ' '.join(['python3', 'disease_analysis.py', 
              '-i', '"{}"'.format(disease_gene_file)])
!{cmd}

Input file is " disease_transcriptomics_data_inputs.xlsx
Initialize project (GSE56649):
Root: G:/GitHub/MADRID/docker/pipelines/
Raw data: G:/GitHub/MADRID/docker/pipelines/data\GSE56649_RAW
Retrieve Sample: G:/GitHub/MADRID/docker/pipelines/data\GSE56649_RAW\GSM1366348.tar
Extract to: G:/GitHub/MADRID/docker/pipelines/data\GSE56649_RAW\GPL570
Retrieve Sample: G:/GitHub/MADRID/docker/pipelines/data\GSE56649_RAW\GSM1366349.tar
Extract to: G:/GitHub/MADRID/docker/pipelines/data\GSE56649_RAW\GPL570
Retrieve Sample: G:/GitHub/MADRID/docker/pipelines/data\GSE56649_RAW\GSM1366350.tar
Extract to: G:/GitHub/MADRID/docker/pipelines/data\GSE56649_RAW\GPL570
Retrieve Sample: G:/GitHub/MADRID/docker/pipelines/data\GSE56649_RAW\GSM1366351.tar
Extract to: G:/GitHub/MADRID/docker/pipelines/data\GSE56649_RAW\GPL570
Retrieve Sample: G:/GitHub/MADRID/docker/pipelines/data\GSE56649_RAW\GSM1366352.tar
Extract to: G:/GitHub/MADRID/docker/pipelines/data\GSE56649_RAW\GPL570
Retrieve Sample: G:/GitHub/MADRID/

  from pandas.core.index import Index as PandasIndex
Traceback (most recent call last):
  File "disease_analysis.py", line 123, in <module>
    main(sys.argv[1:])
  File "disease_analysis.py", line 83, in main
    data2 = affyio.fitaffydir(rawdir, targetdir)
  File "C:\Users\babes\Anaconda3\lib\site-packages\rpy2\robjects\functions.py", line 178, in __call__
    return super(SignatureTranslatedFunction, self).__call__(*args, **kwargs)
  File "C:\Users\babes\Anaconda3\lib\site-packages\rpy2\robjects\functions.py", line 106, in __call__
    res = super(Function, self).__call__(*new_args, **new_kwargs)
rpy2.rinterface.RRuntimeError: Error in file(file, "rt") : cannot open the connection




Retrieve Sample: G:/GitHub/MADRID/docker/pipelines/data\GSE56649_RAW\GSM1366369.tar
Extract to: G:/GitHub/MADRID/docker/pipelines/data\GSE56649_RAW\GPL570
Retrieve Samples Completed.
GPL570:affy, G:/GitHub/MADRID/docker/pipelines/data\GSE56649_RAW\GPL570
Error in file(file, "rt") : cannot open the connection
In file(file, "rt") :
  cannot open file 'data\targets.txt': No such file or directory


In [None]:
# load the results of step 3 to dictionary 'disease_files'
step3_results_file = os.path.join(configs.datadir, 'step2_results_files.json')
with open(step3_results_file) as json_file:
    disease_files = json.load(json_file)
print(disease_files)

## Step 4: Identification of drug targets and repurposable drugs
This step maps drug targets in metabolic models,prforms knock out simulation, and compare simulation results with disease genes and identifies drug targets and repurposable drugs

*** Specify input files for step 4 here ***

1. Instruction: A processed Drug-Target file is included in the `/root/pipelines/data/`. (Optional step) For the updated versions the users can download `Repurposing_Hub_export.txt` from [Drug Repurposing Hub](https://clue.io/repurposing-app). From the downloaded file first remove all the activators, agonists, and withdrawn drugs and then upload to to `/root/pipelines/data/`.

2. To use automatically created tissue specific models. Note: It is recommended to use refined and validated models for further analysis. User can define cutomized models in next sub-step.

In [None]:
# tissue specific models
tissue_spec_model

In [None]:
Disease_Down = disease_files['DN_Reg']
Disease_Up = disease_files['UP_Reg']
drug_raw_file = 'Repurposing_Hub_export.txt'

3. To use customized model, please specify `tissue_spec_model` manually, e.g. uncomment tissue_spec_model in the following cell.

In [None]:
# Manually specify Up and Down Regulated Genes for Disease. (Please upload manually created files `/pipelines/data/`. Use filenames as given belwo or change them accordingly.)
# Disease_Down = 'Disease_DOWN.txt'
# Disease_Up = 'Disease_UP.txt'
# drug_raw_file = 'Repurposing_Hub_export.txt'

# Manually specify tissue specific models fine-tuned by user. Change names of the files accordingly. Users can use single or multiple models here. Using multiple models, simulation time will increase.
# tissue_spec_model = {'Th1':'Th1Model.mat',
#                      'Th2':'Th2Model.mat',
#                      'Th17':'Th17Model.mat',
#                      'Naive':'NaiveModel.mat'}

# Manually specify tissue specific model created by matlab cobratoolbox. For example run, we have provided four models of CD4+ T cells (niave, Th1, Th2, and Th17) please uncomment all or any specific model
# tissue_spec_model = {'Th1':'Th1_SpecificModel_matlab.mat',
#                      'Th2':'Th2_SpecificModel_matlab.mat',
#                      'Th17':'Th17_SpecificModel_matlab.mat',
#                      'Naive':'Naive_SpecificModel_matlab.mat'}


In [None]:
# Knock out simulation for the analyzed tissues
for key,value in tissue_spec_model.items():
    tissueSpecificModelfile = value
    tissue_gene_folder = os.path.join(configs.datadir, key)
    os.makedirs(tissue_gene_folder, exist_ok=True)
    inhibitors_file = '{}_inhibitors_Entrez.txt'.format(key)
    cmd = ' '.join(['python3' , 'knock_out_simulation.py',
                  '-t', tissueSpecificModelfile,
                  '-i', inhibitors_file,
                  '-u', Disease_Up,
                  '-d', Disease_Down,
                  '-f', key,
                  '-r', drug_raw_file])
    !{cmd}
    
    # copy generated output to output folder
    cmd = ' '.join(['cp', '-a', os.path.join(configs.datadir, key), configs.outputdir])
    !{cmd}
    #break
