# MADRID - MetAbolic Drug Repurposing and IDentification 

## Instructions

This Jupyterlab is designed for running the MADRID pipeline to identify drug targets and repurposing drugs for user-defined complex human diseases. The entire process contains five steps:


### Before running, you must provide the proper files depending on the types of data you will use for model creation. 
- For RNA-seq: Either a correctly formatted folder named "MADRID_input" in the the data directory. Proper inputs can be generated using our Snakemake pipeline specifically designed for MADRID: https://github.com/HelikarLab/FastqToGeneCounts.  RNA-seq data can either be single-cell or bulk, but our snakemake pipeline provided is for bulk only at the time. If processing your own single cell data to use with MADRID, or using an alternative pipeline to generate gene counts, follow the instructions in section 1.

- For proteomics. A matrix of measurement ****NEED TO DISCUSS WITH BHANWAR WHAT THESE VAULES MEAN***** where rows are proteins in entrez format and columns are arbitrary sample names.

- For microarray, results must be uploaded to Gene Expression Omnibus, the only thing to provide in MADRID is a configuration file with GSE, GSM, and GPL codes. microarray_data_inputs.xlsx shows a template to use. Note that microarray has become obsolete and it is recomended to use RNA-seq if possible.

### Five steps to identifying drug targets, stop after step 3 if building a context-specific model for other purposes
        
1. Preprocess Bulk RNAseq data by converting STAR outputed gene counts files into a unified matrix and fetching necessary info about each gene required for normalization. Also generate a configuration sheet if not making manually. 

2. Analyze microarray, bulk RNAseq, and proteomics data, output a list of active genes.

3. Create tissue specific models based on the list of active genes. If required the user can manually refine these models and supply them in Step 4. 

4. Identify differential gene expressions from disease datasets using either microarray or bulk RNAseq transcriptomics information.

5. Identify drug targets and repruposable drugs. This step consists of four substeps. 
    - mapping drugs on automatically created or user-supplied models
    - knock-out simulation
    - compare simulation results of perturbed and unperturbed models
    - integrate with disease genes and score drug targets.

### Configuration sheet information
The user should upload config excel sheets to the docker container `/work/data/config_sheets`. The sheet names in these config files should correspond to different models where each sheet contains a list of the samples to include for that model. These sample names should correspond to the samples names in the source data which is defined in `/work/data/data_matrices/<model name>/`
    
In the original docker image, some exemplary input files are included to build metabolic models of naive, Th1, Th2, and Th17 subtypes and identify drug targets for rheumatoid arthritis. User should follow the documentation and the format of the exemplary input files as a template to create your own input files.

In [181]:
# import necessary python packages
import sys
import os
import pandas
import numpy
import json
import re
from subprocess import call
from project import configs
import bioservices
import pprint

## Step 1: Initialize and Preprocess RNA-seq data 

Bulk RNA-seq data can be given as a count matrix where each column is a different sample/replicate named 'exampleTissueName_SXRYrZ' where X is the study (or batch) number, Y is the replicate number, and Z is the run number. If the replicate does not contain multiple runs the rZ should be neglected. Replicates should come from the same study/sample group and different samples can come from different published studies (or batches) as long as the tissue/cell was under similar enough conditions for your modeling purpose. 
<br>

There are 2 ways to initiate MADRID:
<br>

1. Provide a properly formatted counts matrix in `/work/data/data_matrices/exampleTissue/gene_counts_matrix_exampleTissue.csv` where the rows are name exampleTissue_SXRY (note the lack of run number since runs should be summed into a single set of counts). If providing the count matrix this way, instead of generating one using method 1, you will also have to create a configuration file by hand that has each sample's name, study/batch number, and if using zFPKM, layout and mean fragment length. Use the provided template to help. **This method is best if you are downloading a premade count matrix, or using single-cell data that has already been batch corrected, clustered, and sorted into only the cell type of interest!**
<br>
 
2. Provide a properly formatted `MADRID_inputs` folder in the data directory. 
<br>

> **It is recommended that you use our Snakemake pipeline to align and create a properly formated MADRID_inputs folder.** The pipeline also runs a series of important quality control methods to help determine if any of the provided samples are not suitable for model creation. This pipeline can be found at https://github.com/HelikarLab/FastqToGeneCounts.

> Follow the following guide to create a `MADRID_inputs` folder using from your own protocol.

> - The top level of the directory has seperate tissues/cells to create seperate models from. The next level must have a folder called `geneCounts` and optionally a `strandedness` folder. If using zFPKM normalization, there should also be two more folders called `layouts` and `fragmentSizes`. Inside each of these folders should be folders named SX (wherer X is an arbitrary user-defined number that replicates are assoicated with. 
<br>    

> - Inside these study number folders of `geneCounts` should be outputs of STAR aligner with using --quantMode GeneCounts. To help MADRID (and you!) stay organized, these outputs should be renamed `exampleTissueName_SXRYrZ.tab` where X is the study (or batch) number, Y is the replicate number, and Z is the run number. If the replicate does not contain multiple runs the rZ should be neglected. Replicates should come from the same study/sample group and different samples can come from different published studies (or batches) as long as the tissue/cell was under similar enough conditions for your modeling purpose.
<br>

> - Inside the study number folders of `strandedness` should be files named `exampleTissue_SXRyrZ_strandedness.txt`. These files must tell the strandedness of the the RNA-seq method used, and must contain one of the following text (and nothing else):
>> * NONE
>> * FIRST_READ_TRANSCRIPTION_STRAND
>> * SECOND_READ_TRANSCRIPTION_STRAND
<br>

> - Inside the study number folders of `layouts` should be files named `exampleTissue_SXRYrZ_layout.txt` where each file tells the layout of the library used, and must contain one of the following text (and nothing else):
>> * paired-end
>> * single-end
<br>

> - Inside the study number folders of `fragmentSizes` should be files named `exampleTissue_SXRYrZ_fragment_sizes.txt` and contain the output of RSeQC's RNA_fragment_size.py function.
 
 
### The structure below shows an example of how this `MADRID_inputs` should be assembled.

├── STAR_output <br />
&emsp;&emsp;&emsp;└── exampleTissue <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;├── geneCounts <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;├── S1 <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;├── exampleTissue_S1R1.tab <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;├── exampleTissue_S1R2.tab <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;├── exampleTissue_S1R3.tab <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;└── exampleTissue_S1R4.tab <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;├── S2 <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;├── exampleTissue_S2R1.tab <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;├── exampleTissue_S2R2.tab <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;├── exampleTissue_S2R3.tab <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;└── exampleTissue_S2R4.tab <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;└── exampleTissue_S2R4.tab <br />
&nbsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;|&emsp;&emsp;&emsp;&ensp;└── S3 <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;&emsp;&emsp;&emsp; ├── exampleTissue_S3R1.tab <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;&emsp;&emsp;&emsp; ├── exampleTissue_S3R2.tab <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;&emsp;&emsp;&emsp; ├── exampleTissue_S3R3r1.tab <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;&emsp;&emsp;&emsp; ├── exampleTissue_S3R3r2.tab <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;&emsp;&emsp;&emsp; ├── exampleTissue_S3R3r3.tab <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;&emsp;&emsp;&emsp; ├── exampleTissue_S3R4r1.tab <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;&emsp;&emsp;&emsp; ├── exampleTissue_S3R4r2.tab <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;&emsp;&emsp;&emsp; ├── exampleTissue_S3R4r3.tab <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;&emsp;&emsp;&emsp; ├── exampleTissue_S3R5r1.tab <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;&emsp;&emsp;&emsp; ├── exampleTissue_S3R5r2.tab <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;&emsp;&emsp;&emsp; ├── exampleTissue_S3R5r3.tab <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;&emsp;&emsp;&emsp; ├── exampleTissue_S3R6r1.tab <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;&emsp;&emsp;&emsp; ├── exampleTissue_S3R6r2.tab <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;&emsp;&emsp;&emsp; └── exampleTissue_S3R6r3.tab <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;├── strandedness <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;├── S1 <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;├──  exampleTissue_S1R1_strandedness.txt <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;├── exampleTissue_S1R2_strandedness.txt <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;├── exampleTissue_S1R3_strandedness.txt <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;└── exampleTissue_S1R4_strandedness.txt <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;├── S2 <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;├── exampleTissue_S2R1_strandedness.txt <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;├── exampleTissue_S2R2_strandedness.txt <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;├── exampleTissue_S2R3_strandedness.txt <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;└── exampleTissue_S2R4_strandedness.txt <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;└── S3 <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;├── exampleTissue_S3R1_strandedness.txt <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;├── exampleTissue_S3R2_strandedness.txt <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;├── exampleTissue_S3R3r1_strandedness.txt <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;├── exampleTissue_S3R3r2_strandedness.txt <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;├── exampleTissue_S3R3r3_strandedness.txt <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;├── exampleTissue_S3R4r1_strandedness.txt <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;├── exampleTissue_S3R4r2_strandedness.txt <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;├── exampleTissue_S3R4r3_strandedness.txt <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;├── exampleTissue_S3R5r1_strandedness.txt <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;├── exampleTissue_S3R5r2_strandedness.txt <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;├── exampleTissue_S3R5r3_strandedness.txt <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;├── exampleTissue_S3R6r1_strandedness.txt <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;├── exampleTissue_S3R6r2_strandedness.txt <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;└── exampleTissue_S3R6r3_strandedness.txt <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;├── layouts <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;├── S1 <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;├── exampleTissue_S1R1_layout.txt <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;├── exampleTissue_S1R2_layout.txt <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;├── exampleTissue_S1R3_layout.txt <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;└── exampleTissue_S1R4_layout.txt <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;├── S2 <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;├── exampleTissue_S2R1_layout.txt <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;├── exampleTissue_S2R2_layout.txt <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;├── exampleTissue_S2R3_layout.txt <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;└── exampleTissue_S2R4_layout.txt <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;└── S3 <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;├── exampleTissue_S3R1_layout.txt <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;├── exampleTissue_S3R2_layout.txt <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;├── exampleTissue_S3R3r1_layout.txt <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;├── exampleTissue_S3R3r2_layout.txt <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;├── exampleTissue_S3R3r3_layout.txt <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;├── exampleTissue_S3R4r1_layout.txt <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;├── exampleTissue_S3R4r2_layout.txt <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;├── exampleTissue_S3R4r3_layout.txt <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;├── exampleTissue_S3R5r1_layout.txt <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;├── exampleTissue_S3R5r2_layout.txt <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;├── exampleTissue_S3R5r3_layout.txt <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;├── exampleTissue_S3R6r1_layout.txt <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;├── exampleTissue_S3R6r2_layout.txt <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;└── exampleTissue_S3R6r3_layout.txt <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;└── fragementSizes (optional, for zFPKM only) <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp; &emsp;&emsp;&emsp;├── S1 <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp; &emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;├── exampleTissue_S1R1_fragment_size.txt <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp; &emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;├── exampleTissue_S1R2_fragment_size.txt <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp; &emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;├── exampleTissue_S1R3_fragment_size.txt <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp; &emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;└── exampleTissue_S1R4_fragment_size.txt <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp; &emsp;&emsp;&emsp;├── S2 <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp; &emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;├── exampleTissue_S2R1_fragment_size.txt <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp; &emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;├── exampleTissue_S2R2_fragment_size.txt <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp; &emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;├── exampleTissue_S2R3_fragment_size.txt <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp; &emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;└── exampleTissue_S2R4_fragment_size.txt <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp; &emsp;&emsp;&emsp;└── S3  <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp; &emsp;&emsp;&emsp;&emsp;&emsp;&emsp;├── exampleTissue_S3R2_fragment_size.txt <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp; &emsp;&emsp;&emsp;&emsp;&emsp;&emsp;├── exampleTissue_S3R3r1_fragment_size.txt <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp; &emsp;&emsp;&emsp;&emsp;&emsp;&emsp;├── exampleTissue_S3R3r2_fragment_size.txt <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp; &emsp;&emsp;&emsp;&emsp;&emsp;&emsp;├── exampleTissue_S3R3r3_fragment_size.txt <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp; &emsp;&emsp;&emsp;&emsp;&emsp;&emsp;├── exampleTissue_S3R4r1_fragment_size.txt <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp; &emsp;&emsp;&emsp;&emsp;&emsp;&emsp;├── exampleTissue_S3R4r2_fragment_size.txt <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp; &emsp;&emsp;&emsp;&emsp;&emsp;&emsp;├── exampleTissue_S3R4r3_fragment_size.txt <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp; &emsp;&emsp;&emsp;&emsp;&emsp;&emsp;├── exampleTissue_S3R5r1_fragment_size.txt <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp; &emsp;&emsp;&emsp;&emsp;&emsp;&emsp;├── exampleTissue_S3R5r2_fragment_size.txt <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp; &emsp;&emsp;&emsp;&emsp;&emsp;&emsp;├── exampleTissue_S3R5r3_fragment_size.txt <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp; &emsp;&emsp;&emsp;&emsp;&emsp;&emsp;├── exampleTissue_S3R6r1_fragment_size.txt <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp; &emsp;&emsp;&emsp;&emsp;&emsp;&emsp;├── exampleTissue_S3R6r2_fragment_size.txt <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp; &emsp;&emsp;&emsp;&emsp;&emsp;&emsp;└── exampleTissue_S3R6r3_fragment_size.txt <br />

Currently, MADRID can filter raw RNA-seq counts using three normalization techniques.

> - TPM Quantile, where each replicate is normalized with Transcripts-per-million and an upper quantile is taken to create a boolean list of active genes for the replicate. Replicates are compared for consensus within the study/batch number according to user-defined ratios and then study/batch numbers are checked for consensus according to different user defined ratios.   **CITATION NEEDED** **Recomended if user wants more control over the size of the model, like a smaller model that allows for only the most expressed reactions, or a larger more encompassing one that contains less essential reactions.

> - zFPKM method outlined in: https://pubmed.ncbi.nlm.nih.gov/24215113/ can be used. Counts will be normalized using zFPKM and genes > -3 will be considered expressed per thier recommendation. Expressed genes will be checked for consensus at the replicate and study/batch levels the same as TPM Quantile. **Recommended if user wants to give least input over gene essentially determination and use the most standardized method of active gene determination. 

> - flat cutoff of CPM (counts per million) normalized values, check for consensus the same as other methods. **Not recommended**

Regardless of normalization technique used, or provided files used for RNA-seq, preprocessing is required to fetch relevent gene information needed for harmonization and normalization such as Entrez ID, and the start and end postions.


In [200]:
# Step 1: Preprocess RNAseq data by generating a counts matrix, config sheet, and gene file from MADRID_inputs.

tissue_names = "['m0Macro', 'm1Macro', 'm2Macro']" # tissue name or list of tissue names within a string
create_counts_matrix = True # set to false if using a pregenerated matrix file
gene_format = "Ensembl"     # accepts 'Entrez', 'Ensembl', and 'Symbol'
taxon_id = "human" # accepts integer (bioDBnet taxon id) or 'human' or 'mouse'
preprocess_mode = "info-matrix-config"
    
cmd = ' '.join(['python3', 'rnaseq_preprocess.py',
                '--tissue-name', '"{}"'.format(tissue_names),
                '--gene-format', '"{}"'.format(gene_format),
                '--taxon-id', '"{}"'.format(taxon_id),
                '--{}'.format(preprocess_mode)])
!{cmd}

[0;1;31mSystem has not been booted with systemd as init system (PID 1). Can't operate.[0m
[0;1;31mFailed to create bus connection: Host is down[0m
['rnaseq_preprocess.py', '--tissue-name', "['m0Macro', 'm1Macro', 'm2Macro']", '--gene-format', 'Ensembl', '--taxon-id', 'human', '--info-matrix-config']
Preprocessing m0Macro
Gene info output directory is "/home/jupyteruser/work/data/results/m0Macro"
Looking for STAR gene count tables in '/home/jupyteruser/work/data/MADRID_input/m0Macro'
Creating Counts Matrix for 'm0Macro'
[1] "Organizing Files"
[1] "Creating counts matrix"
Count Matrix written at  /home/jupyteruser/work/data/data_matrices/m0Macro/gene_counts_matrix_m0Macro.csv 
Fetching gene info using genes in '/home/jupyteruser/work/data/data_matrices/m0Macro/gene_counts_matrix_m0Macro.csv'
Ensembl Gene ID
Total Genes to Retrieve: 61541
retrieve 0:300
retrieve 300:600
retrieve 600:900
retrieve 900:1200
retrieve 1200:1500
retrieve 1500:1800
retrieve 1800:2100
retrieve 2100:2400
retri

## Step 2: Identifying Gene Activity in Transcriptomics and Proteomics Datasets

Identify gene activity in the following data sources 
 - RNA-seq (bulk or single-cell)
 - Proteomics
 - Microarry

 Only one sources is required for model generation, multiple can be helpful for additional validation if of high-quality. 

In [239]:
# Specific input files for step 2

# config file for microarray
microarray_config_file = 'microarray_data_inputs.xlsx'

# config for bulk rna-seq
rnaseq_config_file = 'rnaseq_data_inputs_auto.xlsx'

# config file for proteomics
proteomics_config_file = 'proteomics_data_inputs.xlsx'

# ratio of replicates required for a gene to be considered active in that sample
active_ratio = 0.5

# Genes can be considered high confidence if they are expressed in a high proportion of samples.
# High confidence genes will be considered expressed regardless of agreement with other data sources
high_ratio = 0.9

In [145]:
# Step 2.1 Download and Analyze microarray data

cmd = ' '.join(['python3', 'microarray_gen.py', 
      '-i', '"{}"'.format(microarray_config_file),
      '-e', '"{}"'.format(expression_proportion),
      '-t', '"{}"'.format(top_proportion)])
!{cmd}

Input file is  microarray_data_inputs.xlsx
Expression Proportion for Gene Expression is  0.5
Top proportion for high-confidence genes is  0.9
---
Start Collecting Data for:
['GSE22886' 'GSE43005' 'GSE22045' 'GSE24634']
['GSM565273' 'GSM565274' 'GSM565275' 'GSM565290' 'GSM565291' 'GSM565292'
 'GSM1054773' 'GSM1054779' 'GSM1054781' 'GSM1054789' 'GSM548000'
 'GSM548001' 'GSM607510' 'GSM607511' 'GSM607512']
---

Initialize project (GSE22886):
Root: /home/jupyteruser/work
Raw data: /home/jupyteruser/work/data/GSE22886_RAW
Retrieve Sample: /home/jupyteruser/work/data/GSE22886_RAW/GSM565273.tar
Extract to: /home/jupyteruser/work/data/GSE22886_RAW/GPL96
Retrieve Sample: /home/jupyteruser/work/data/GSE22886_RAW/GSM565274.tar
Extract to: /home/jupyteruser/work/data/GSE22886_RAW/GPL96
Retrieve Sample: /home/jupyteruser/work/data/GSE22886_RAW/GSM565275.tar
Extract to: /home/jupyteruser/work/data/GSE22886_RAW/GPL96
Retrieve Sample: /home/jupyteruser/work/data/GSE22886_RAW/GSM565290.tar
Extract to: 

In [217]:
# step 2.2 RNA-seq Analysis

rep_ratio = 0.5          # proportion of replicates for a gene to be active in a sample
group_ratio = 0.9        # proportion of samples with expression required for gene  
rep_ratio_h = 0.5        # proportion of replicates with expression required for high-confidence
group_ratio_h = 0.9      # proportion of replicates with expression required for high-confidence
technique = "quantile"   # quantile-tpm, cpm, or zfpkm
quantile = 25           # cutoff TPM percentile for quantile filtering 

cmd = ' '.join(['python3', 'rnaseq_gen.py', 
      '--config-file', '"{}"'.format(rnaseq_config_file), 
      '--replicate-ratio', '"{}"'.format(rep_ratio),   
      '--group-ratio', '"{}"'.format(group_ratio),        
      '--high-replicate-ratio', '"{}"'.format(rep_ratio_h),    
      '--high-group-ratio', '"{}"'.format(group_ratio_h),   
      '--filt-technique', '"{}"'.format(technique),        
      '--quantile', '"{}"'.format(quantile)])       
                
!{cmd}

[0;1;31mSystem has not been booted with systemd as init system (PID 1). Can't operate.[0m
[0;1;31mFailed to create bus connection: Host is down[0m
Config file is "rnaseq_data_inputs_auto.xlsx"
model:  m0Macro
Input count matrix is at "/home/jupyteruser/work/data/data_matrices/m0Macro/gene_counts_matrix_m0Macro.csv"
Gene info file is at "/home/jupyteruser/work/data/results/m0Macro/gene_info_m0Macro.csv"
[1] "Reading Counts Matrix"
[1] "/home/jupyteruser/work/data/data_matrices/m0Macro/gene_counts_matrix_m0Macro.csv"
[1] "m0Macro"
[1] "Filtering Counts"
Test data saved to /home/jupyteruser/work/data/results/m0Macro/rnaseq_m0Macro.csv
model:  m1Macro
Input count matrix is at "/home/jupyteruser/work/data/data_matrices/m1Macro/gene_counts_matrix_m1Macro.csv"
Gene info file is at "/home/jupyteruser/work/data/results/m1Macro/gene_info_m1Macro.csv"
[1] "Reading Counts Matrix"
[1] "/home/jupyteruser/work/data/data_matrices/m1Macro/gene_counts_matrix_m1Macro.csv"
[1] "m1Macro"
[1] "Filtering

In [6]:
# Step 2.3 Proteomics Analysis

quantile = 25

cmd = ' '.join(['python3', 'proteomics_gen.py', 
      '-c', '"{}"'.format(proteomics_config_file),
      '-e', '"{}"'.format(expression_proportion),
      '-t', '"{}"'.format(top_proportion),
      '-p', '"{}"'.format(quantile)])
!{cmd}

Config file is at "/home/jupyteruser/work/data/config_sheets/proteomics_data_inputs.xlsx"
Data matrix is at "/home/jupyteruser/work/data/data_matrices/Naive/ProteomicsDataMatrix_Naive.csv"
Test Data Saved to /home/jupyteruser/work/data/results/Naive/Proteomics_Naive.csv


In [251]:
# Step 2.4 Merge the gene lists of data sources, create a list of active gene IDs

expression_requirement = 1 # number of data souces with expression required for a gene
                           # to be considered active if not a top gene for any source
                           # (defaults to the total number of input data sources)

cmd = ' '.join(['python3', 'merge_xomics.py', 
      #'--microarray-config-file', '"{}"'.format(microarray_config_file),
      '--rnaseq-config-file', '"{}"'.format(rnaseq_config_file),
      #'--proteomics-config-file', '"{}"'.format(proteomics_config_file),
      '--expression-requirement', '"{}"'.format(expression_requirement)])
!{cmd}

[0;1;31mSystem has not been booted with systemd as init system (PID 1). Can't operate.[0m
[0;1;31mFailed to create bus connection: Host is down[0m
Bulk RNA-seq file is 'rnaseq_data_inputs_auto.xlsx'
Read from /home/jupyteruser/work/data/results/m0Macro/rnaseq_m0Macro.csv
45 single ENTREZ_GENE_IDs to merge
id_list: 387, set: 356
entrez_single_id_list: 25371, set: 25331
entrez_id_list: 177, set: 176
dups: 52, set: 21
163 id merged
m0Macro: save to /home/jupyteruser/work/data/results/m0Macro/GeneExpression_m0Macro_Merged.csv

Read from /home/jupyteruser/work/data/results/m1Macro/rnaseq_m1Macro.csv
45 single ENTREZ_GENE_IDs to merge
id_list: 387, set: 356
entrez_single_id_list: 25371, set: 25331
entrez_id_list: 177, set: 176
dups: 52, set: 21
163 id merged
m1Macro: save to /home/jupyteruser/work/data/results/m1Macro/GeneExpression_m1Macro_Merged.csv

Read from /home/jupyteruser/work/data/results/m2Macro/rnaseq_m2Macro.csv
45 single ENTREZ_GENE_IDs to merge
id_list: 387, set: 356
entrez

## Step 3: Create Tissue/Cell-Type Specific Models

In [253]:
# Load the output of step 1, which is a dictionary that specifies the merged list of active Gene IDs for each tissue

step1_results_file = os.path.join(configs.rootdir, 'data', 'results', 'step1_results_files.json')
with open(step1_results_file) as json_file:
    tissue_gene_exp = json.load(json_file)
print(tissue_gene_exp)

{'m0Macro': '/home/jupyteruser/work/data/results/m0Macro/GeneExpression_m0Macro_Merged.csv', 'm1Macro': '/home/jupyteruser/work/data/results/m1Macro/GeneExpression_m1Macro_Merged.csv', 'm2Macro': '/home/jupyteruser/work/data/results/m2Macro/GeneExpression_m2Macro_Merged.csv'}


*** Specify input files for step 2 here ***

In [None]:
# create tissue specific model, the names of output files are stored in dictionary tissue_spec_model

general_model_file = "GeneralModel.mat"
#exclude_rxns_file = os.path.join(configs.datadir, "inconsistant_rxns.csv") # flux inconsistant rxns to remove from core reactions in fastcore
#force_rxns_file = os.path.join(configs.datadir, "lit_core_rxns.csv") 
#boundary_rxns_file = os.path.join(configs.datadir, "exchange_rxns,csv")
recon_algorithm = 'FASTCORE' # troppo reconstruction algorithm to use
solver = "GUROBI"
objective = 'biomass_reaction_Mphage'
#objective = 'BIOMASS_reaction'

for key,value in tissue_gene_exp.items():
    tissuefile = '{}_SpecificModel.mat'.format(key) # key is == tissue name
    tissue_gene_file = re.split('/|\\\\', value)[-1]
    #tissue_gene_folder = os.path.join(configs.rootdir, 'data', key)
    #os.makedirs(tissue_gene_folder, exist_ok=True)
    cmd = ' '.join(['python3', 'create_tissue_specific_model.py', 
                      '--tissue-name', '"{}"'.format(key),
                      '--reference-model-file', '"{}"'.format(general_model_file), 
                      '--gene-expression-file', '"{}"'.format(tissue_gene_file),
                      '--objective', '"{}"'.format(objective),
                      #'--boundary-reactions-file', '"{}"'.format(boundary_rxns_file),
                      #'--exclude-reactions-file', '"{}"'.format(exclude_rxns_file),
                      #'--force-reactions-file', '"{}"'.format(force_rxns_file),
                      '--algorithm', '"{}"'.format(recon_algorithm),
                      '--solver', '"{}"'.format(solver)])
    !{cmd}

Tissue Name is "m0Macro"
General Model file is "GeneralModel.mat"
Gene Expression file is "GeneExpression_m0Macro_Merged.csv"
Objective function is "biomass_reaction_Mphage".
Constructing model with "FASTCORE" reconstruction algorithm using "GUROBI" solver
Set parameter WLSAccessID
Set parameter WLSSecret
Set parameter LicenseID
Academic license - for non-commercial use only - registered to bbessell2@unl.edu


## Step 3: Identifying disease related genes by analyzing transcriptomics data of patients
Differential Expression Analysis

In the config_sheets folder, there should be a folder called "disease". You can add a spreadsheet for each cell/tissue type called `disease_data_inputs_<tissue_name>`. Each sheet of this file should correspond to a seperate disease to analyze using DGE nfor that tissue. The source data can be either microarray or bulk RNA-seq and is formatted the same as if creating the base tissue model. The sheet names should contain the disease name, an underscore, and than either "microarray" or "bulk" depending on the source data. For example, if the disease is lupus, and the source data is bulk RNA-seq, the name of the sheet should be "lupus_bulk". This can be seen in the example sheet. If using bulk RNA-seq data, there should be a count matrix file in `/work/data/data_matrices/<tissue_name>/disease/` called `BulkRNAseqDataMatrix_<disease name>_<tissue name>`. 

*** Specify input files for step 3 here ***

In [13]:
# specify tissue names to perform a disease analysis on. The diseases to analyze should be
# specified in `/work/data/config_sheets/disease/diease_data_inputs_<tissue name>`
tissue_names = ['Naive']

In [14]:
# Differential gene expression analysis
for tissue_name in tissue_names:
    disease_config_file = "".join(["disease_data_inputs_", tissue_name, ".xlsx"])
    cmd = ' '.join(['python3', 'disease_analysis.py',
                  '-t', '"{}"'.format(tissue_name),
                  '-c', '"{}"'.format(disease_config_file)])
    !{cmd}

Config file is at  /home/jupyteruser/work/data/config_sheets/disease/disease_data_inputs_Naive.xlsx
Count Matrix File is at  /home/jupyteruser/work/data/data_matrices/Naive/disease/BulkRNAseqDataMatrix_lupus_Naive.csv
[1] "Reading Counts Matrix"
[1] "Performing DGE"
Traceback (most recent call last):
  File "disease_analysis.py", line 186, in <module>
    main(sys.argv[1:])
  File "disease_analysis.py", line 120, in main
    data2 = DGEio.DGE_main(count_matrix_path, inqueryFullPath, tissue_name, disease_name)
  File "/usr/local/lib/python3.8/dist-packages/rpy2/robjects/functions.py", line 198, in __call__
    return (super(SignatureTranslatedFunction, self)
  File "/usr/local/lib/python3.8/dist-packages/rpy2/robjects/functions.py", line 125, in __call__
    res = super(Function, self).__call__(*new_args, **new_kwargs)
  File "/usr/local/lib/python3.8/dist-packages/rpy2/rinterface_lib/conversion.py", line 45, in _
    cdata = function(*args, **kwargs)
  File "/usr/local/lib/python3.8/di

## Step 4: Identification of drug targets and repurposable drugs
This step maps drug targets in metabolic models,prforms knock out simulation, and compare simulation results with disease genes and identifies drug targets and repurposable drugs

*** Specify input files for step 4 here ***

1. Instruction: A processed Drug-Target file is included in the `/root/pipelines/data/`. (Optional step) For the updated versions the users can download `Repurposing_Hub_export.txt` from [Drug Repurposing Hub](https://clue.io/repurposing-app). From the downloaded file first remove all the activators, agonists, and withdrawn drugs and then upload to to `/root/pipelines/data/`.

2. To use automatically created tissue specific models. Note: It is recommended to use refined and validated models for further analysis. User can define cutomized models in next sub-step.

In [15]:
# tissue specific models
tissue_spec_model 

{'Naive': 'Naive_SpecificModel.mat'}

3. To use customized model, please specify `tissue_spec_model` manually, e.g. uncomment tissue_spec_model in the following cell.

In [16]:
# Manually specify Up and Down Regulated Genes for Disease. (Please upload manually created files `/pipelines/data/`. Use filenames as given belwo or change them accordingly.)
# Disease_Down = 'Disease_DOWN.txt'
# Disease_Up = 'Disease_UP.txt'
# drug_raw_file = 'Repurposing_Hub_export.txt'

# Manually specify tissue specific models fine-tuned by user. Change names of the files accordingly. Users can use single or multiple models here. Using multiple models, simulation time will increase.
# tissue_spec_model = {'Th1':'Th1Model.mat',
#                      'Th2':'Th2Model.mat',
#                      'Th17':'Th17Model.mat',
#                      'Naive':'NaiveModel.mat'}

# Manually specify tissue specific model created by matlab cobratoolbox. For example run, we have provided four models of CD4+ T cells (niave, Th1, Th2, and Th17) please uncomment all or any specific model
# tissue_spec_model = {'Th1':'Th1_SpecificModel_matlab.mat',
#                      'Th2':'Th2_SpecificModel_matlab.mat',
#                      'Th17':'Th17_SpecificModel_matlab.mat',
#                      'Naive':'Naive_SpecificModel_matlab.mat'}


In [17]:
# Knock out simulation for the analyzed tissues and diseases
diseases = ['lupus', 'arthritis']
for key,value in tissue_spec_model.items():
    for dis in diseases:
        # load the results of step 3 to dictionary 'disease_files'
        step3_results_file = os.path.join(configs.datadir, 'results', key, 
                                          dis, 'step2_results_files.json')
        with open(step3_results_file) as json_file:
            disease_files = json.load(json_file)
        #print(disease_files)
        Disease_Down = disease_files['DN_Reg']
        Disease_Up = disease_files['UP_Reg']
        drug_raw_file = 'Repurposing_Hub_export.txt'
        
        out_dir = os.path.join(configs.datadir, "results", key, dis)
        tissueSpecificModelfile  = os.path.join(configs.datadir, "results", key, value)
        print(tissueSpecificModelfile)
        tissue_gene_folder = os.path.join(configs.datadir, key)
        os.makedirs(tissue_gene_folder, exist_ok=True)
        inhibitors_file = '{}_inhibitors_Entrez.txt'.format(key)
        cmd = ' '.join(['python3' , 'knock_out_simulation.py',
                      '-t', tissueSpecificModelfile,
                      '-i', inhibitors_file,
                      '-u', Disease_Up,
                      '-d', Disease_Down,
                      '-f', out_dir,
                      '-r', drug_raw_file])
        !{cmd}

        # copy generated output to output folder
        cmd = ' '.join(['cp', '-a', os.path.join(configs.datadir, key), configs.outputdir])
        !{cmd}
        #break


FileNotFoundError: [Errno 2] No such file or directory: '/home/jupyteruser/work/data/results/Naive/lupus/step2_results_files.json'