## Instructions

This jupyter notebook run MADRID pipeline to identify drug targets and repurposing drugs for user-defined complex human diseases. The entire process contains five steps:

0. Preprocess Bulk RNAseq data by converting STAR outputed Gene counts into a unified matrix and fetching necessary info about each gene needed for normalization via TPM or FPKM. 
1. Download and analyze microarray, bulk RNAseq, and proteomics data, output a list of active genes.
2. Create tissue specific models based on the list of active genes. If required the user can manually refine these models and supply them in Step 4. 
3. Identify differential gene expressions from disease datasets using either microarray or bulk RNAseq transcriptomics information.
4. Identify drug targets and repruposable drugs. This step consists of four substeps. 
 (i) mapping drugs on automatically created or user-supplied models, (ii) knock-out simulation, (iii) compare simulation results of perturbed and unperturbed models, and (iv) integrate with disease genes and score drug targets.

The user should upload config excel sheets to the docker container `/work/data/config_sheets`. The sheet names in these config files should correspond to different models where each sheet contains a list of the samples to include for that model. These sample names should correspond to the samples names in the source data which is defined in `/work/data/data_matrices/<model name>/`
    
In the original docker image, some exemplary input files are included to build metabolic models of naive, Th1, Th2, and Th17 subtypes and identify drug targets for rheumatoid arthritis. User should follow the documentation and the format of the exemplary input files to create your own input files.

In [1]:
# import necessary python packages
import sys
import os
import pandas
import numpy
import json
import re
from subprocess import call
from project import configs
import bioservices

# print root path of the project
print(configs.rootdir) 

Creating directory /home/jupyteruser/.config/bioservices 
/home/jupyteruser/work


## Step 0: Preprocess Bulk RNA-seq data 

Bulk RNA-seq data can be given as a count matrix where each column is a different sample/replicate named 'tissuename_SXRYrZ' where X is the sample or study number, Y is the replicate number, and Z is the run number. If the replicate does not contain multiple runs the rZ can be neglected. Replicates should come from the same study/sample group and different samples can come from different studies as long as the tissue/cell was under similar enough conditions for your model. 

If you wish to use raw .fastq data for your bulk RNA-seq inputs, you can align them with STAR using the --gene_counts option and rename the .tab outputs the same as the columns described above. Place the .tab files into a folder called SX where X is the unique study number for the tissue matching the filename. Place each study name folder into a folder titled the tissue name for the model you are building. Place the tissue folder into `/work/data/STAR_out`. An example of this file structure can be found in the STAR_out folder. If using STAR output, be sure that the '-c' argument is 'TRUE'.

Currently, MADRID can filter raw RNA-seq counts using a flat cutoff of CPM (counts per million) normalized values and the recommended 'quantile' technique which normalizes using TPM (transcipts per million) and filters using an upper quantile. Future versions will also allow for the zFPKM method outlined in this paper: https://pubmed.ncbi.nlm.nih.gov/24215113/ 

Preprocessing will fetch relevent gene information needed for normalization such as the start and end postions, so be sure to supply either 'cpm' or 'quantile' as the -t argument in preprocess, and make sure its the same as the one used in bulk_gen.py in step 1. 



In [96]:
# Step 0: Preprocess bulk RNAseq dat by generate count matrix from gene counts files
# generated from STAR and/or fetching necessary gene info from BioDBnet

technique = "quantile"      # technique for bulk RNA-seq active gene determination
                            # for count matrix gen, only used to determine whether or not
                            # picard output mean fragment sizes are required.

tissue_names = "['liver_control']"
create_counts_matrix = True # set to false if using a pregenerated matrix file
gene_format = "Ensembl"     # accepts 'Entrez', 'Ensembl', and 'Symbol'
taxon_id = 'mouse'
    
cmd = ' '.join(['python3', 'bulkRNAPreprocess.py',
                '-n', '"{}"'.format(tissue_names),
                '-c', '"{}"'.format(create_counts_matrix),
                '-f', '"{}"'.format(gene_format),
                '-i', '"{}"'.format(taxon_id),
                '-t', '"{}"'.format(technique)])
!{cmd}

[0;1;31mSystem has not been booted with systemd as init system (PID 1). Can't operate.[0m
[0;1;31mFailed to create bus connection: Host is down[0m
['bulkRNAPreprocess.py', '-n', "['liver_control']", '-c', 'True', '-f', 'Ensembl', '-i', 'mouse', '-t', 'quantile']
liver_control
Input directory is "/home/jupyteruser/work/data/STAR_output/liver_control"
Gene info output directory is "/home/jupyteruser/work/data/results/liver_control"
Active gene determination technique is "quantile"
Creating Counts Matrix
[1] "Organizing Files"
[1] 1
[1] 2
[1] 3
[1] 4
[1] 5
[1] "Creating counts matrix"
[1] "colname"
[1] "liver_control_S3R2"
[1] "colname"
[1] "liver_control_S3R3"
[1] "colname"
[1] "liver_control_S4R3"
[1] "colname"
[1] "liver_control_S4R4"
[1] "colname"
[1] "liver_control_S4R5"
[1] "colname"
[1] "liver_control_S4R6"
[1] "colname"
[1] "liver_control_S5R3"
[1] "pre-split"
[1] "liver_control_S6R2r1"
[1] "post-split"
[1] "liver_control_S6R2"
[1] "colname"
[1] "liver_control_S6R2"
[1] "pre-s

## Step 1: Identifying gene activity by analyzing transcriptomics and proteomics datasets

*** Specify input files for step 1 here ***

All three data types are not needed for model generation. Skip any data sources not being used for your model. 

In [97]:
# Specific input files for step 1

# config file for microarray
microarray_config_file = 'microarray_data_inputs.xlsx'

# config for bulk rna-seq
bulk_config_file = 'bulk_data_inputs_mouse.xlsx'

# config file for proteomics
proteomics_config_file = 'proteomics_data_inputs.xlsx'

# ratio of replicates required for a gene to be considered active in that sample
expression_proportion = 0.5

# Genes can be considered high confidence (labeled as 'top') if they are expressed
# in a high proportion of samples. High confidence genes will be considered expressed
# regardless of agreement with other data sources
top_proportion = 0.9

In [93]:
# Step 1.1 Download and analyze microarray
cmd = ' '.join(['python3', 'microarray_gen.py', 
      '-i', '"{}"'.format(microarray_config_file),
      '-e', '"{}"'.format(expression_proportion),
      '-t', '"{}"'.format(top_proportion)])
!{cmd}

^C
From cffi callback <function _processevents at 0x7f6b11e1b550>:
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/rpy2/rinterface_lib/callbacks.py", line 277, in _processevents
    try:
  File "/usr/local/lib/python3.8/dist-packages/rpy2/rinterface.py", line 87, in _sigint_handler
    raise KeyboardInterrupt()
KeyboardInterrupt


In [252]:
# step 1.2 Analyze Bulk-RNA-seq 

exp_prop_rep = 0.5     # proportion of replicates for a gene to be active in a sample
exp_prop_samp = 0.9    # proportion of samples with expression required for gene  
top_prop_rep = 0.5     # proportion of replicates with expression required for high-confidence
top_prop_samp = 0.9    # proportion of replicates with expression required for high-confidence
technique = "quantile" # filtering technique for active gene detrmination
quantile = 80           # cutoff TPM percentile for quantile filtering 

cmd = ' '.join(['python3', 'bulk_gen.py',   
      '-c', '"{}"'.format(bulk_config_file), 
      '-r', '"{}"'.format(exp_prop_rep),   
      '-s', '"{}"'.format(exp_prop_samp),        
      '-x', '"{}"'.format(top_prop_rep),    
      '-y', '"{}"'.format(top_prop_samp),   
      '-t', '"{}"'.format(technique),        
      '-q', '"{}"'.format(quantile)])       
                
!{cmd}

[0;1;31mSystem has not been booted with systemd as init system (PID 1). Can't operate.[0m
[0;1;31mFailed to create bus connection: Host is down[0m
Config file is "bulk_data_inputs_mouse.xlsx"
Input count matrix is at "/home/jupyteruser/work/data/data_matrices/liver_control/BulkRNAseqDataMatrix_liver_control.csv"
Gene info file is at "/home/jupyteruser/work/data/results/liver_control/GeneInfo_liver_control.csv"
[1] "Reading Counts Matrix"
[1] "cmat"
[1] "/home/jupyteruser/work/data/data_matrices/liver_control/BulkRNAseqDataMatrix_liver_control.csv"
[1] "config"
[1] "/home/jupyteruser/work/data/config_sheets/bulk_data_inputs_mouse.xlsx"
[1] "info"
[1] "/home/jupyteruser/work/data/results/liver_control/GeneInfo_liver_control.csv"
[1] "model"
[1] "liver_control"
[1] "Filtering Counts"
Test data saved to /home/jupyteruser/work/data/results/liver_control/Bulk_liver_control.csv


In [6]:
# Step 1.3 Analyze proteomics
quantile = 25

cmd = ' '.join(['python3', 'proteomics_gen.py', 
      '-c', '"{}"'.format(proteomics_config_file),
      '-e', '"{}"'.format(expression_proportion),
      '-t', '"{}"'.format(top_proportion),
      '-p', '"{}"'.format(quantile)])
!{cmd}

Config file is at "/home/jupyteruser/work/data/config_sheets/proteomics_data_inputs.xlsx"
Data matrix is at "/home/jupyteruser/work/data/data_matrices/Naive/ProteomicsDataMatrix_Naive.csv"
Test Data Saved to /home/jupyteruser/work/data/results/Naive/Proteomics_Naive.csv


In [253]:
# Step 1.4 Merge the gene lists of transcriptomics and proteomics, create a list of active gene IDs

expression_requirement=1 # number of data souces with expression required for a gene
                         # to be considered active if not a top gene for any source
                         # (defaults to the total number of input data sources)

cmd = ' '.join(['python3', 'merge_xomics.py', 
      ##'-t', '"{}"'.format(microarray_config_file),
      '-b', '"{}"'.format(bulk_config_file),
      #'-p', '"{}"'.format(proteomics_config_file),
      '-r', '"{}"'.format(expression_requirement)])
!{cmd}

[0;1;31mSystem has not been booted with systemd as init system (PID 1). Can't operate.[0m
[0;1;31mFailed to create bus connection: Host is down[0m
Microarray file is "None"
Proteomics file is "None"
Bulk RNA-seq file is "bulk_data_inputs_mouse.xlsx"
Read from /home/jupyteruser/work/data/results/liver_control/Bulk_liver_control.csv
83 single ENTREZ_GENE_IDs to merge
id_list: 443, set: 394
entrez_single_id_list: 26277, set: 26240
entrez_id_list: 161, set: 161
dups: 79, set: 30
136 id merged
liver_control: save to /home/jupyteruser/work/data/results/liver_control/GeneExpression_liver_control_Merged.csv



## Step 2: Create tissue-specific or cell-type-specific Models

In [254]:
# Load the output of step 1, which is a dictionary that specifies the merged list of active Gene IDs for each tissue

step1_results_file = os.path.join(configs.rootdir, 'data', 'results', 'step1_results_files.json')
with open(step1_results_file) as json_file:
    tissue_gene_exp = json.load(json_file)
print(tissue_gene_exp)

{'liver_control': '/home/jupyteruser/work/data/results/liver_control/GeneExpression_liver_control_Merged.csv'}


*** Specify input files for step 2 here ***

In [255]:
# (input) filename of General Model, Recon3D_Teff_ver2
GeneralModelFile = 'iMM1865_madrid.mat'
#GeneralModelFile = 'GeneralModel.mat'
excludeRxns = os.path.join(configs.datadir, 'inconsistant_rxns.csv') # flux inconsistant rxns to remove from core reactions in fastcore
forceRxns = os.path.join(configs.datadir, 'lit_core_rxns.csv') 
reconAlgorithm = 'FASTCORE' # troppo reconstruction algorithm to use
#objective = 'biomass_reaction_Mphage'
objective = 'BIOMASS_reaction'

In [None]:
# create tissue specific model, the names of output files are stored in dictionary tissue_spec_model
tissue_spec_model = {}

for key,value in tissue_gene_exp.items():
    tissuefile = '{}_SpecificModel.mat'.format(key) # key is == tissue name
    tissue_spec_model[key] = tissuefile
    tissue_gene_file = re.split('/|\\\\', value)[-1]
    #tissue_gene_folder = os.path.join(configs.rootdir, 'data', key)
    #os.makedirs(tissue_gene_folder, exist_ok=True)
    cmd = ' '.join(['python3', 'create_tissue_specific_model.py', 
                      '-t', '"{}"'.format(key),
                      '-m', '"{}"'.format(GeneralModelFile), 
                      '-g', '"{}"'.format(tissue_gene_file),
                      '-o', '"{}"'.format(tissuefile),
                      '-s', '"{}"'.format(objective),
                      '-x', '"{}"'.format(excludeRxns),
                      #'-f', '"{}"'.format(forceRxns),
                      '-a', '"{}"'.format(reconAlgorithm)])
    !{cmd}

print(tissue_spec_model)

Tissue Name is "liver_control"
General Model file is "iMM1865_madrid.mat"
Gene Expression file is "GeneExpression_liver_control_Merged.csv"
Output file is "liver_control_SpecificModel.mat"
Using "FASTCORE" reconstruction algorithm
Scaling...
 A: min|aij| =  1.000e+00  max|aij| =  1.000e+00  ratio =  1.000e+00
Problem data seem to be well scaled
Map gene expression to reactions, 0 errors.
J size2806
[   30    31    32 ... 10608 10609 10611]
before LP7
LP7
Could not set parameters with this solver
-0.2805999999999854
done LP7
LP9
Could not set parameters with this solver
149974.14121764514
done LP9
197 5250
before LP7
LP7
Could not set parameters with this solver
-0.017900000000451777
done LP7
LP9
Could not set parameters with this solver


## Step 3: Identifying disease related genes by analyzing transcriptomics data of patients
Differential Expression Analysis

In the config_sheets folder, there should be a folder called "disease". You can add a spreadsheet for each cell/tissue type called `disease_data_inputs_<tissue_name>`. Each sheet of this file should correspond to a seperate disease to analyze using DGE nfor that tissue. The source data can be either microarray or bulk RNA-seq and is formatted the same as if creating the base tissue model. The sheet names should contain the disease name, an underscore, and than either "microarray" or "bulk" depending on the source data. For example, if the disease is lupus, and the source data is bulk RNA-seq, the name of the sheet should be "lupus_bulk". This can be seen in the example sheet. If using bulk RNA-seq data, there should be a count matrix file in `/work/data/data_matrices/<tissue_name>/disease/` called `BulkRNAseqDataMatrix_<disease name>_<tissue name>`. 

*** Specify input files for step 3 here ***

In [13]:
# specify tissue names to perform a disease analysis on. The diseases to analyze should be
# specified in `/work/data/config_sheets/disease/diease_data_inputs_<tissue name>`
tissue_names = ['Naive']

In [14]:
# Differential gene expression analysis
for tissue_name in tissue_names:
    disease_config_file = "".join(["disease_data_inputs_", tissue_name, ".xlsx"])
    cmd = ' '.join(['python3', 'disease_analysis.py',
                  '-t', '"{}"'.format(tissue_name),
                  '-c', '"{}"'.format(disease_config_file)])
    !{cmd}

Config file is at  /home/jupyteruser/work/data/config_sheets/disease/disease_data_inputs_Naive.xlsx
Count Matrix File is at  /home/jupyteruser/work/data/data_matrices/Naive/disease/BulkRNAseqDataMatrix_lupus_Naive.csv
[1] "Reading Counts Matrix"
[1] "Performing DGE"
Traceback (most recent call last):
  File "disease_analysis.py", line 186, in <module>
    main(sys.argv[1:])
  File "disease_analysis.py", line 120, in main
    data2 = DGEio.DGE_main(count_matrix_path, inqueryFullPath, tissue_name, disease_name)
  File "/usr/local/lib/python3.8/dist-packages/rpy2/robjects/functions.py", line 198, in __call__
    return (super(SignatureTranslatedFunction, self)
  File "/usr/local/lib/python3.8/dist-packages/rpy2/robjects/functions.py", line 125, in __call__
    res = super(Function, self).__call__(*new_args, **new_kwargs)
  File "/usr/local/lib/python3.8/dist-packages/rpy2/rinterface_lib/conversion.py", line 45, in _
    cdata = function(*args, **kwargs)
  File "/usr/local/lib/python3.8/di

## Step 4: Identification of drug targets and repurposable drugs
This step maps drug targets in metabolic models,prforms knock out simulation, and compare simulation results with disease genes and identifies drug targets and repurposable drugs

*** Specify input files for step 4 here ***

1. Instruction: A processed Drug-Target file is included in the `/root/pipelines/data/`. (Optional step) For the updated versions the users can download `Repurposing_Hub_export.txt` from [Drug Repurposing Hub](https://clue.io/repurposing-app). From the downloaded file first remove all the activators, agonists, and withdrawn drugs and then upload to to `/root/pipelines/data/`.

2. To use automatically created tissue specific models. Note: It is recommended to use refined and validated models for further analysis. User can define cutomized models in next sub-step.

In [15]:
# tissue specific models
tissue_spec_model 

{'Naive': 'Naive_SpecificModel.mat'}

3. To use customized model, please specify `tissue_spec_model` manually, e.g. uncomment tissue_spec_model in the following cell.

In [16]:
# Manually specify Up and Down Regulated Genes for Disease. (Please upload manually created files `/pipelines/data/`. Use filenames as given belwo or change them accordingly.)
# Disease_Down = 'Disease_DOWN.txt'
# Disease_Up = 'Disease_UP.txt'
# drug_raw_file = 'Repurposing_Hub_export.txt'

# Manually specify tissue specific models fine-tuned by user. Change names of the files accordingly. Users can use single or multiple models here. Using multiple models, simulation time will increase.
# tissue_spec_model = {'Th1':'Th1Model.mat',
#                      'Th2':'Th2Model.mat',
#                      'Th17':'Th17Model.mat',
#                      'Naive':'NaiveModel.mat'}

# Manually specify tissue specific model created by matlab cobratoolbox. For example run, we have provided four models of CD4+ T cells (niave, Th1, Th2, and Th17) please uncomment all or any specific model
# tissue_spec_model = {'Th1':'Th1_SpecificModel_matlab.mat',
#                      'Th2':'Th2_SpecificModel_matlab.mat',
#                      'Th17':'Th17_SpecificModel_matlab.mat',
#                      'Naive':'Naive_SpecificModel_matlab.mat'}


In [17]:
# Knock out simulation for the analyzed tissues and diseases
diseases = ['lupus', 'arthritis']
for key,value in tissue_spec_model.items():
    for dis in diseases:
        # load the results of step 3 to dictionary 'disease_files'
        step3_results_file = os.path.join(configs.datadir, 'results', key, 
                                          dis, 'step2_results_files.json')
        with open(step3_results_file) as json_file:
            disease_files = json.load(json_file)
        #print(disease_files)
        Disease_Down = disease_files['DN_Reg']
        Disease_Up = disease_files['UP_Reg']
        drug_raw_file = 'Repurposing_Hub_export.txt'
        
        out_dir = os.path.join(configs.datadir, "results", key, dis)
        tissueSpecificModelfile  = os.path.join(configs.datadir, "results", key, value)
        print(tissueSpecificModelfile)
        tissue_gene_folder = os.path.join(configs.datadir, key)
        os.makedirs(tissue_gene_folder, exist_ok=True)
        inhibitors_file = '{}_inhibitors_Entrez.txt'.format(key)
        cmd = ' '.join(['python3' , 'knock_out_simulation.py',
                      '-t', tissueSpecificModelfile,
                      '-i', inhibitors_file,
                      '-u', Disease_Up,
                      '-d', Disease_Down,
                      '-f', out_dir,
                      '-r', drug_raw_file])
        !{cmd}

        # copy generated output to output folder
        cmd = ' '.join(['cp', '-a', os.path.join(configs.datadir, key), configs.outputdir])
        !{cmd}
        #break


FileNotFoundError: [Errno 2] No such file or directory: '/home/jupyteruser/work/data/results/Naive/lupus/step2_results_files.json'