# **MADRID - MetAbolic Drug Repurposing and IDentification** 

## **Who should use this pipeline?**

This Jupyterlab pipeline has everything necessary to build a context-specific constraint-based metabolic model from any single source or combination of sources from the following '-omics' data.
- Bulk RNA-seq
- Single-cell RNA-seq
- Proteomics (specific name???)
- Microarray


It also serves as a platform to use these models to identify drug targets and potentially repurposable drugs for metabolism-impacting diseases.

Using MADRID does not require or expect any amount of programming experience to create models. However every step of the pipeline is packaged in its own .py file to promote accessible modification, addition, or replacement of analysis steps. The Jupyterlab container comes pre-loaded with the most popular R and Python libraries, however, if you would like to use a library and cannot install it for any reason, please request it on our Github page!<br>
https://github.com/HelikarLab/MADRID

***Warning! If you terminate your session after running the Docker, any changes you make WILL NOT BE SAVED! Please mount a local directory to the Docker image as instructed on the Github and Dockerhub README's to prevent data loss.**


## **Before You Use**

### **Before running, you must provide the proper files depending on the types of data you will use for model creation.**
- For RNA-seq: Either a correctly formatted folder named "MADRID_input" in the the data directory. Proper inputs can be generated using our Snakemake pipeline specifically designed for MADRID: https://github.com/HelikarLab/FastqToGeneCounts.  RNA-seq data can either be single-cell or bulk, but our Snakemake pipeline provided is for bulk only at the time. If processing RNA-seq data with an alternative procedure, or importing a pre-made gene count matrix, follow the instructions in Step 1.

- For proteomics. A matrix of measurement ****NEED TO DISCUSS WITH BHANWAR WHAT THESE VAULES MEAN***** where rows are proteins in Entrez format and columns are arbitrary sample names.

- For microarray, results must be uploaded to Gene Expression Omnibus, the only thing to provide in MADRID is a configuration file with GSE, GSM, and GPL codes. microarray_data_inputs.xlsx shows a template to use. Note that microarray has become obsolete and it is recomended to use RNA-seq if possible.

### **Six steps to identifying drug targets, stop after step 3 if building a context-specific model for other purposes**
        
1. Preprocess Bulk RNAseq data by converting STAR outputed gene counts files into a unified matrix, fetch necessary info about each gene required for normalization, and generate a configuration sheet.

2. Analyze any combination of microarray, RNAseq (total, polyA, single-cell), and proteomics data, output a list of active genes for each strategy.

3. Check for consensus amongst strategies according to desired rigor and merge into a singular set of active genes

4. Create tissue specific models based on the list of active genes. If required the user can manually refine these models and supply them in Step 4. 

5. Identify differential gene expressions from disease datasets using either microarray or bulk RNAseq transcriptomics information.

6. Identify drug targets and repruposable drugs. This step consists of four substeps. 
    - mapping drugs on automatically created or user-supplied models
    - knock-out simulation
    - compare simulation results of perturbed and unperturbed models
    - integrate with disease genes and score drug targets.

### **Configuration sheet information**
The user should upload config .xlsx files to `/work/data/config_sheets`. The sheet names in these config files should correspond to the context (tissue name, cell name, control, etc.) where each sheet contains the sample names to include in that context-specifc model. These sample names should correspond to the sample (column) names in the source data matrix which should be uploaded (or outputed) in `/work/data/data_matrices/<model name>/`
    
In the original Docker image, some exemplary input files are included to build metabolic models of naive, Th1, Th2, and Th17 subtypes and identify drug targets for rheumatoid arthritis. User can follow the documentation and the format of the exemplary input files and and use the provided template files to create your own input files.

## **Step 1: Initialize and Preprocess RNA-seq data**
**Skip if not using RNA-seq)**

RNA-seq data is read by MADRID as count matrix where each column is a different sample/replicate named 'exampleTissueName_SXRYrZ' where:
- X is the study (or batch) number
- Y is the replicate number
- Z is the run number. If the replicate does not contain multiple runs for a single replicate, then "rZ" should be neglected.
- exampleTissueName is the name of the model that will be built from this data. It should be consistant with other data sources if they are to be integrated. **Note that this identifier should not have any special characters including "\_" since it may interfere with parsing.**
<br>

Replicates should come from the same study/batch group and different study/batch numbers can come from different published studies (or batches) as long as the tissue/cell was under similar enough conditions for your modeling purpose. Run numbers in the same replicate will be summed together.
<br>
<br>
#### **Example:**
Say S1 represents a study or batch, and S2 represents a different study or batch of RNA-seq data from m0 macrophages whose model we will name m0Macro. The studies were conducted in a different lab, by different researchers at different times using a different library preparation kit. In each study there are two replicates (R1, and R2) obtained for each one. m0Macro_S1R1 and m0Macro_S1R2 will be checked for consensus to generate a list of active genes in both replicates. These active genes will then be checked for consensus with the consensus of m0Macro_S2R1 and m0Macro_S2R2 to output a list of active genes in both studies.
<br><br>
The reason this system is used is not only to help keep you and MADRID organized. Most types of normalized gene counts are not good for direct comparisons across replicates, and are especially not suitable for comparisons across different experiments. Therefore, MADRID will convert normalized gene counts into a boolean list of active genes. These lists will be compared at the the level of replicates in a batch (or study) and then again at the level of all provided batches (or study). Finally, the active genes will be merged with the outputs of proteomics, microarray and differnt RNA-seq stategies if provided. The rigor used at each level can be easily modified by the user.
<br>

#### **Two ways to initialize RNA-seq data**

1. Import a properly formatted `MADRID_inputs` folder in the `data` directory. 
<br>

> **It is recommended that you use our Snakemake pipeline to align and create a properly formated MADRID_inputs folder.** The pipeline also runs a series of important quality control methods to help determine if any of the provided samples are not suitable for model creation. This pipeline can be found at https://github.com/HelikarLab/FastqToGeneCounts.

> Or, **if using your own alignment protocol**, follow this guide to create a `MADRID_inputs` folder.

> - The top level of the directory has seperate tissues/cells to create seperate models from. The next level must have a folder called `geneCounts` and optionally a `strandedness` folder. If using zFPKM normalization, there should also be two more folders called `layouts` and `fragmentSizes`. Inside each of these folders should be folders named SX (wherer X is an arbitrary user-defined number that replicates are associated with. 
<br>    

> - Inside these study number folders of `geneCounts` should be outputs of STAR aligner with using --quantMode GeneCounts. To help MADRID (and you!) stay organized, these outputs should be renamed `exampleTissueName_SXRYrZ.tab` where X is the study (or batch) number, Y is the replicate number, and Z is the run number. If the replicate does not contain multiple runs the rZ should be neglected. Replicates should come from the same study/sample group and different samples can come from different published studies (or batches) as long as the tissue/cell was under similar enough conditions for your modeling purpose.
<br>

> - Inside the study number folders of `strandedness` should be files named `exampleTissue_SXRYrZ_strandedness.txt`. These files must tell the strandedness of the the RNA-seq method used, and must contain one of the following text (and nothing else):
>> * NONE
>> * FIRST_READ_TRANSCRIPTION_STRAND
>> * SECOND_READ_TRANSCRIPTION_STRAND
<br>

> - Inside the study number folders of `layouts` should be files named `exampleTissue_SXRYrZ_layout.txt` where each file tells the layout of the library used, and must contain one of the following text (and nothing else):
>> * paired-end
>> * single-end
<br>

> - Inside the study number folders of `fragmentSizes` should be files named `exampleTissue_SXRYrZ_fragment_sizes.txt` and contain the output of RSeQC's RNA_fragment_size.py function.

> - Inside the study number folders of `prepMethods` should be files named `exampleTissue_SXRYrZ_prep_method.txt` where each file tells the library preparation strategy, which must be one of the following.
>> * total
>> * mRNA
<br><br>
where 'total' refers to Total RNA and mRNA refers to polyA enriched RNA. Note that these strategies serve only to differentiate the methods in the event that both are used to build a model. If a different library strategy is desired, you can use either one as a placeholder, or will very little Python knowledge, it is easy add a new strategy to `merge_xomics.py`.  

2. Import a properly formatted counts matrix in `/work/data/data_matrices/exampleTissue/gene_counts_matrix_exampleTissue.csv` where the rows are name exampleTissue_SXRY (note the lack of run number since runs should be summed into a single set of counts). **If providing the count matrix this way, instead of generating one using method 1, you will also have to create a configuration file** that has each sample's name, study/batch number, and if using zFPKM, layout and mean fragment length. **Use the provided template** to help. Once provided, run rnaseq_preprocess.py with the '--provide-matrix' argument. <br><br> ***This method is best if you are downloading a premade count matrix, or using single-cell data that has already been batch corrected, clustered, and sorted into only the cell type of interest!**

#### **Structure of MADRD_input folder**
├── STAR_output <br />
&emsp;&emsp;&emsp;└── exampleTissue <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;├── geneCounts <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;├── S1 <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;├── exampleTissue_S1R1.tab <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;├── exampleTissue_S1R2.tab <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;├── exampleTissue_S1R3.tab <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;└── exampleTissue_S1R4.tab <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;├── S2 <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;├── exampleTissue_S2R1.tab <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;├── exampleTissue_S2R2.tab <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;├── exampleTissue_S2R3.tab <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;└── exampleTissue_S2R4.tab <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;└── exampleTissue_S2R4.tab <br />
&nbsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;|&emsp;&emsp;&emsp;&ensp;└── S3 <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;&emsp;&emsp;&emsp; ├── exampleTissue_S3R1.tab <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;&emsp;&emsp;&emsp; ├── exampleTissue_S3R2.tab <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;&emsp;&emsp;&emsp; ├── exampleTissue_S3R3r1.tab <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;&emsp;&emsp;&emsp; ├── exampleTissue_S3R3r2.tab <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;&emsp;&emsp;&emsp; ├── exampleTissue_S3R3r3.tab <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;&emsp;&emsp;&emsp; ├── exampleTissue_S3R4r1.tab <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;&emsp;&emsp;&emsp; ├── exampleTissue_S3R4r2.tab <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;&emsp;&emsp;&emsp; ├── exampleTissue_S3R4r3.tab <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;&emsp;&emsp;&emsp; ├── exampleTissue_S3R5r1.tab <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;&emsp;&emsp;&emsp; ├── exampleTissue_S3R5r2.tab <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;&emsp;&emsp;&emsp; ├── exampleTissue_S3R5r3.tab <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;&emsp;&emsp;&emsp; ├── exampleTissue_S3R6r1.tab <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;&emsp;&emsp;&emsp; ├── exampleTissue_S3R6r2.tab <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;&emsp;&emsp;&emsp; └── exampleTissue_S3R6r3.tab <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;├── strandedness <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;├── S1 <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;├──  exampleTissue_S1R1_strandedness.txt <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;├── exampleTissue_S1R2_strandedness.txt <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;├── exampleTissue_S1R3_strandedness.txt <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;└── exampleTissue_S1R4_strandedness.txt <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;├── S2 <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;├── exampleTissue_S2R1_strandedness.txt <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;├── exampleTissue_S2R2_strandedness.txt <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;├── exampleTissue_S2R3_strandedness.txt <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;└── exampleTissue_S2R4_strandedness.txt <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;└── S3 <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;├── exampleTissue_S3R1_strandedness.txt <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;├── exampleTissue_S3R2_strandedness.txt <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;├── exampleTissue_S3R3r1_strandedness.txt <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;├── exampleTissue_S3R3r2_strandedness.txt <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;├── exampleTissue_S3R3r3_strandedness.txt <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;├── exampleTissue_S3R4r1_strandedness.txt <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;├── exampleTissue_S3R4r2_strandedness.txt <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;├── exampleTissue_S3R4r3_strandedness.txt <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;├── exampleTissue_S3R5r1_strandedness.txt <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;├── exampleTissue_S3R5r2_strandedness.txt <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;├── exampleTissue_S3R5r3_strandedness.txt <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;├── exampleTissue_S3R6r1_strandedness.txt <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;├── exampleTissue_S3R6r2_strandedness.txt <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;└── exampleTissue_S3R6r3_strandedness.txt <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;├── layouts <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;├── S1 <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;├── exampleTissue_S1R1_layout.txt <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;├── exampleTissue_S1R2_layout.txt <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;├── exampleTissue_S1R3_layout.txt <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;└── exampleTissue_S1R4_layout.txt <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;├── S2 <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;├── exampleTissue_S2R1_layout.txt <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;├── exampleTissue_S2R2_layout.txt <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;├── exampleTissue_S2R3_layout.txt <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;└── exampleTissue_S2R4_layout.txt <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;└── S3 <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;├── exampleTissue_S3R1_layout.txt <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;├── exampleTissue_S3R2_layout.txt <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;├── exampleTissue_S3R3r1_layout.txt <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;├── exampleTissue_S3R3r2_layout.txt <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;├── exampleTissue_S3R3r3_layout.txt <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;├── exampleTissue_S3R4r1_layout.txt <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;├── exampleTissue_S3R4r2_layout.txt <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;├── exampleTissue_S3R4r3_layout.txt <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;├── exampleTissue_S3R5r1_layout.txt <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;├── exampleTissue_S3R5r2_layout.txt <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;├── exampleTissue_S3R5r3_layout.txt <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;├── exampleTissue_S3R6r1_layout.txt <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;├── exampleTissue_S3R6r2_layout.txt <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;└── exampleTissue_S3R6r3_layout.txt <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;├── prepMethods (Optional, unless using multiple RNA-seq types) <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;├── S1 <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;├── exampleTissue_S1R1_prep_method.txt <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;├── exampleTissue_S1R2_prep_method <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;├── exampleTissue_S1R3_prep_method.txt <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;└── exampleTissue_S1R4_prep_method.txt <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;├── S2 <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;├── exampleTissue_S2R1_prep_method.txt <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;├── exampleTissue_S2R2_prep_method.txt <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;├── exampleTissue_S2R3_prep_method.txt <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;└── exampleTissue_S2R4_prep_method.txt <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;└── S3 <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;├── exampleTissue_S3R1_prep_method.txt <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;├── exampleTissue_S3R2_prep_method.txt <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;├── exampleTissue_S3R3r1_prep_method.txt <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;├── exampleTissue_S3R3r2_prep_method.txt <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;├── exampleTissue_S3R3r3_prep_method.txt <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;├── exampleTissue_S3R4r1_prep_method.txt <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;├── exampleTissue_S3R4r2_prep_method.txt <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;├── exampleTissue_S3R4r3_prep_method.txt <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;├── exampleTissue_S3R5r1_prep_method.txt <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;├── exampleTissue_S3R5r2_prep_method.txt <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;├── exampleTissue_S3R5r3_prep_method.txt <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;├── exampleTissue_S3R6r1_prep_method.txt <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;├── exampleTissue_S3R6r2_prep_method.txt <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;└── exampleTissue_S3R6r3_prep_method.txt <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;└── fragementSizes (optional, for zFPKM only) <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp; &emsp;&emsp;&emsp;├── S1 <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp; &emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;├── exampleTissue_S1R1_fragment_size.txt <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp; &emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;├── exampleTissue_S1R2_fragment_size.txt <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp; &emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;├── exampleTissue_S1R3_fragment_size.txt <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp; &emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;└── exampleTissue_S1R4_fragment_size.txt <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp; &emsp;&emsp;&emsp;├── S2 <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp; &emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;├── exampleTissue_S2R1_fragment_size.txt <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp; &emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;├── exampleTissue_S2R2_fragment_size.txt <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp; &emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;├── exampleTissue_S2R3_fragment_size.txt <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp; &emsp;&emsp;&emsp;│&emsp;&emsp;&emsp;└── exampleTissue_S2R4_fragment_size.txt <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp; &emsp;&emsp;&emsp;└── S3  <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp; &emsp;&emsp;&emsp;&emsp;&emsp;&emsp;├── exampleTissue_S3R2_fragment_size.txt <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp; &emsp;&emsp;&emsp;&emsp;&emsp;&emsp;├── exampleTissue_S3R3r1_fragment_size.txt <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp; &emsp;&emsp;&emsp;&emsp;&emsp;&emsp;├── exampleTissue_S3R3r2_fragment_size.txt <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp; &emsp;&emsp;&emsp;&emsp;&emsp;&emsp;├── exampleTissue_S3R3r3_fragment_size.txt <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp; &emsp;&emsp;&emsp;&emsp;&emsp;&emsp;├── exampleTissue_S3R4r1_fragment_size.txt <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp; &emsp;&emsp;&emsp;&emsp;&emsp;&emsp;├── exampleTissue_S3R4r2_fragment_size.txt <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp; &emsp;&emsp;&emsp;&emsp;&emsp;&emsp;├── exampleTissue_S3R4r3_fragment_size.txt <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp; &emsp;&emsp;&emsp;&emsp;&emsp;&emsp;├── exampleTissue_S3R5r1_fragment_size.txt <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp; &emsp;&emsp;&emsp;&emsp;&emsp;&emsp;├── exampleTissue_S3R5r2_fragment_size.txt <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp; &emsp;&emsp;&emsp;&emsp;&emsp;&emsp;├── exampleTissue_S3R5r3_fragment_size.txt <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp; &emsp;&emsp;&emsp;&emsp;&emsp;&emsp;├── exampleTissue_S3R6r1_fragment_size.txt <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp; &emsp;&emsp;&emsp;&emsp;&emsp;&emsp;├── exampleTissue_S3R6r2_fragment_size.txt <br />
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp; &emsp;&emsp;&emsp;&emsp;&emsp;&emsp;└── exampleTissue_S3R6r3_fragment_size.txt <br />

Currently, MADRID can filter raw RNA-seq counts using three normalization techniques.

> - TPM Quantile, where each replicate is normalized with Transcripts-per-million and an upper quantile is taken to create a boolean list of active genes for the replicate. Replicates are compared for consensus within the study/batch number according to user-defined ratios and then study/batch numbers are checked for consensus according to different user defined ratios. **Recomended if user wants more control over the size of the model, like a smaller model that allows for only the most expressed reactions, or a larger more encompassing one that contains less essential reactions.

> - zFPKM method outlined in: https://pubmed.ncbi.nlm.nih.gov/24215113/ can be used. Counts will be normalized using zFPKM and genes > -3 will be considered expressed per thier recommendation. Expressed genes will be checked for consensus at the replicate and study/batch levels the same as TPM Quantile. **Recommended if user wants to give least input over gene essentially determination and use the most standardized method of active gene determination. 

> - flat cutoff of CPM (counts per million) normalized values, check for consensus the same as other methods. **Not recommended**

Regardless of normalization technique used, or provided files used for RNA-seq, preprocessing is required to fetch relevent gene information needed for harmonization and normalization such as Entrez ID, and the start and end postions.


In [113]:
# import necessary python packages
import sys
import os
import pandas
import numpy
import json
import re
from subprocess import call
from project import configs
import bioservices
import pprint

In [None]:
# Step 1: Preprocess RNAseq data by generating a counts matrix, config sheet, and gene file from MADRID_inputs.

tissue_names = "['m0Macro', 'm1Macro', 'm2Macro']" # tissue name or list of tissue names within a string
create_counts_matrix = True # set to false if using a pregenerated matrix file
gene_format = "Ensembl"     # accepts 'Entrez', 'Ensembl', and 'Symbol'
taxon_id = "human" # accepts integer (bioDBnet taxon id) or 'human' or 'mouse'
preprocess_mode = "create-matrix" # "create-matrix" or "provide-matrix"
    
cmd = ' '.join(['python3', 'rnaseq_preprocess.py',
                '--tissue-name', '"{}"'.format(tissue_names),
                '--gene-format', '"{}"'.format(gene_format),
                '--taxon-id', '"{}"'.format(taxon_id),
                '--{}'.format(preprocess_mode)])
!{cmd}

## Step 2: Identifying Gene Activity in Transcriptomics and Proteomics Datasets

Identify gene activity in the following data sources 
 - RNA-seq (bulk or single-cell)
 - Proteomics
 - Microarry

 Only one sources is required for model generation, multiple can be helpful for additional validation if of high-quality. 

### Microarrays

From wikipedia: A microarray is a multiplex lab-on-a-chip. Its purpose is to simultaneously detect the expression of thousands of genes from a sample (e.g. from a tissue). It is a two-dimensional array on a solid substrate—usually a glass slide or silicon thin-film cell—that assays (tests) large amounts of biological material using high-throughput screening miniaturized, multiplexed and parallel processing and detection methods.

MADRID can directly download and analyze microarray data from GEO for Agilent and Affymetrix platoforms. Follow the template and example in `/work/data/config_sheets/` to use microarray data in your analysis.

Microarray technology is becoming increasingly obsolete, if possible it is recommended that you use RNA-seq instead. Although different strategies exist for microarrays, MADRID does not distinguish between them nor are  there any plans to in the future due to it's obsoelecense.

In [145]:
# Step 2.1 Download and Analyze microarray data

# config file for microarray
microarray_config_file = 'microarray_data_inputs.xlsx'

# execute
cmd = ' '.join(['python3', 'microarray_gen.py', 
      '-i', '"{}"'.format(microarray_config_file),
      '-e', '"{}"'.format(expression_proportion),
      '-t', '"{}"'.format(top_proportion)])
!{cmd}

Input file is  microarray_data_inputs.xlsx
Expression Proportion for Gene Expression is  0.5
Top proportion for high-confidence genes is  0.9
---
Start Collecting Data for:
['GSE22886' 'GSE43005' 'GSE22045' 'GSE24634']
['GSM565273' 'GSM565274' 'GSM565275' 'GSM565290' 'GSM565291' 'GSM565292'
 'GSM1054773' 'GSM1054779' 'GSM1054781' 'GSM1054789' 'GSM548000'
 'GSM548001' 'GSM607510' 'GSM607511' 'GSM607512']
---

Initialize project (GSE22886):
Root: /home/jupyteruser/work
Raw data: /home/jupyteruser/work/data/GSE22886_RAW
Retrieve Sample: /home/jupyteruser/work/data/GSE22886_RAW/GSM565273.tar
Extract to: /home/jupyteruser/work/data/GSE22886_RAW/GPL96
Retrieve Sample: /home/jupyteruser/work/data/GSE22886_RAW/GSM565274.tar
Extract to: /home/jupyteruser/work/data/GSE22886_RAW/GPL96
Retrieve Sample: /home/jupyteruser/work/data/GSE22886_RAW/GSM565275.tar
Extract to: /home/jupyteruser/work/data/GSE22886_RAW/GPL96
Retrieve Sample: /home/jupyteruser/work/data/GSE22886_RAW/GSM565290.tar
Extract to: 

### **RNA-seq Analysis**

RNA-seq analysis has two primary types, bulk tissue, and single-cell. Bulk RNA-seq also has multiple strategies of library preparation. If using public data, the user may run into a situation where they wish to use a combination of bulk RNA-seq data produced using two very different library preparation strategies. 
  

#### **Bulk RNA-seq**

MADRID currently supports the two most common strategies, mRNA (polyA) enriched RNA-seq, and total RNA-seq. 

Because of expected differences in distribution of transcripts, MADRID is written to handle each strategy seperately before the integration step. The recommended Snakemake alignment pipeline is designed to work with MADRID's preprocessing step (Step 1) to split RNA-seq data from GEO into seperate input matrices and config sheets.

**To create a gene expression file for total RNA-seq data, use the "total" for the '--library-prep' argument**

**To create a gene expression file for mRNA enriched / polA RNA-seq data, use the "mRNA" for the '--library-prep' argument.**

The analysis of each strategy is identical so specifying the type only serves to ensure MADRID analyzes them seperately.

#### **Single-cell RNA-seq**

While the Snakemake pipeline does not yet support single-cell alignment and MADRID does not yet support automated configuration file and counts-matrix file creation for single-cell alignment output from STAR, it is possible to use single-cell data to create a model with MADRID. 

Since normalization strategies can be applied to single-cell the same way it is applied to bulk, rnaseq_gen.py can be used with a provided counts matrix and config sheet (see Step 1 to help create it). 

Just like 'total' and 'mRNA', rnaseq_gen.py can be run with "SC" as the '--library-prep' argument to help MADRID differentiate it from the bulk RNA-seq data if using multiple strategies.   

In [61]:
# step 2.2 RNA-seq Analysis for Total RNA-seq library preparation

# config for total rna-seq
trnaseq_config_file = 'trnaseq_data_inputs_auto.xlsx'

rep_ratio = 0.5          # proportion of replicates for a gene to be active in a sample
group_ratio = 0.9        # proportion of samples with expression required for gene  
rep_ratio_h = 0.5        # proportion of replicates with expression required for high-confidence
group_ratio_h = 0.9      # proportion of replicates with expression required for high-confidence
technique = "quantile"   # quantile-tpm, cpm, or zfpkm
quantile = 50            # cutoff TPM percentile for quantile filtering
prep_method = "total"    # library prepartion method ('total', mRNA', or 'SC')

cmd = ' '.join(['python3', 'rnaseq_gen.py', 
      '--config-file', '"{}"'.format(trnaseq_config_file), 
      '--replicate-ratio', '"{}"'.format(rep_ratio),   
      '--group-ratio', '"{}"'.format(group_ratio),        
      '--high-replicate-ratio', '"{}"'.format(rep_ratio_h),    
      '--high-group-ratio', '"{}"'.format(group_ratio_h),   
      '--filt-technique', '"{}"'.format(technique),        
      '--quantile', '"{}"'.format(quantile),
      '--library-prep', '"{}"'.format(prep_method)])       
                
!{cmd}

[0;1;31mSystem has not been booted with systemd as init system (PID 1). Can't operate.[0m
[0;1;31mFailed to create bus connection: Host is down[0m
Config file is "trnaseq_data_inputs_auto.xlsx"
model:  m0Macro
Input count matrix is at "/home/jupyteruser/work/data/data_matrices/m0Macro/gene_counts_matrix_total_m0Macro.csv"
Gene info file is at "/home/jupyteruser/work/data/results/m0Macro/gene_info_m0Macro.csv"
[1] "Reading Counts Matrix"
[1] "/home/jupyteruser/work/data/data_matrices/m0Macro/gene_counts_matrix_total_m0Macro.csv"
[1] "m0Macro"
[1] "Filtering Counts"
Test data saved to /home/jupyteruser/work/data/results/m0Macro/rnaseq_total_m0Macro.csv
model:  m1Macro
Input count matrix is at "/home/jupyteruser/work/data/data_matrices/m1Macro/gene_counts_matrix_total_m1Macro.csv"
Gene info file is at "/home/jupyteruser/work/data/results/m1Macro/gene_info_m1Macro.csv"
[1] "Reading Counts Matrix"
[1] "/home/jupyteruser/work/data/data_matrices/m1Macro/gene_counts_matrix_total_m1Macro.cs

In [62]:
# step 2.3 mRNA capture (polyA) RNA-seq Analysis

# config for mRNA (polyA) enriched RNA-seq 
mrnaseq_config_file = 'mrnaseq_data_inputs_auto.xlsx'

rep_ratio = 0.5          # proportion of replicates for a gene to be active in a sample
group_ratio = 0.9        # proportion of samples with expression required for gene  
rep_ratio_h = 0.5        # proportion of replicates with expression required for high-confidence
group_ratio_h = 0.9      # proportion of replicates with expression required for high-confidence
technique = "quantile"   # quantile-tpm, cpm, or zfpkm
quantile = 50           # cutoff TPM percentile for quantile filtering
prep_method = "mrna"

cmd = ' '.join(['python3', 'rnaseq_gen.py', 
      '--config-file', '"{}"'.format(mrnaseq_config_file), 
      '--replicate-ratio', '"{}"'.format(rep_ratio),   
      '--group-ratio', '"{}"'.format(group_ratio),        
      '--high-replicate-ratio', '"{}"'.format(rep_ratio_h),    
      '--high-group-ratio', '"{}"'.format(group_ratio_h),   
      '--filt-technique', '"{}"'.format(technique),        
      '--quantile', '"{}"'.format(quantile),
      '--library-prep', '"{}"'.format(prep_method)])       
                
!{cmd}

[0;1;31mSystem has not been booted with systemd as init system (PID 1). Can't operate.[0m
[0;1;31mFailed to create bus connection: Host is down[0m
Config file is "mrnaseq_data_inputs_auto.xlsx"
model:  m0Macro
Input count matrix is at "/home/jupyteruser/work/data/data_matrices/m0Macro/gene_counts_matrix_mrna_m0Macro.csv"
Gene info file is at "/home/jupyteruser/work/data/results/m0Macro/gene_info_m0Macro.csv"
[1] "Reading Counts Matrix"
[1] "/home/jupyteruser/work/data/data_matrices/m0Macro/gene_counts_matrix_mrna_m0Macro.csv"
[1] "m0Macro"
[1] "Filtering Counts"
Test data saved to /home/jupyteruser/work/data/results/m0Macro/rnaseq_mrna_m0Macro.csv


In [6]:
# Step 2.4 Proteomics Analysis

# config file for proteomics
proteomics_config_file = 'proteomics_data_inputs.xlsx'

# ratio of replicates required for a gene to be considered active in that sample
active_ratio = 0.5

# Genes can be considered high confidence if they are expressed in a high proportion of samples.
# High confidence genes will be considered expressed regardless of agreement with other data sources
high_ratio = 0.9

quantile = 25

cmd = ' '.join(['python3', 'proteomics_gen.py', 
      '-c', '"{}"'.format(proteomics_config_file),
      '-e', '"{}"'.format(active_ratio),
      '-t', '"{}"'.format(high_ratio),
      '-p', '"{}"'.format(quantile)])
!{cmd}

Config file is at "/home/jupyteruser/work/data/config_sheets/proteomics_data_inputs.xlsx"
Data matrix is at "/home/jupyteruser/work/data/data_matrices/Naive/ProteomicsDataMatrix_Naive.csv"
Test Data Saved to /home/jupyteruser/work/data/results/Naive/Proteomics_Naive.csv


In [102]:
tname = "test_name"
d = dict({tname: ["lala", "adfaf", "asdfsadf"]})
if list(d.keys())[0] == "test_name":
    print("true")
    
tname = "test_name"
d = (tname, ["lala", "adfaf", "asdfsadf"])
if d[0] == "test_name":
    print("true")



true
true


## Step 3: Merge Expression from Different Data Sources

So far, active genes have been determined for at least one data source. If using multiple sources of any combination of microarray, bulk RNA-seq of either total RNA or mRNA capture (polyA), proteomics, or single-cell RNA-seq. Now we can merge the active genes from each data source to make a list of active genes that is more comprehenisve or more strict than any individual list. 

`merge_xomics.py` takes each used data source as an argument, and it is easy to add new ones to the script if desired. The other arguments to consider are:
- **--expression_requirement** which is the number of data souces with expression required for a gene to be considered active if not a high-confidence gene for any source (defaults to the total number of input data sources arguments provided)

- **--requirement_adjust** is the method to adjust the required when available sources differ from input arguments for a tissue (*Note that this does nothing if there is only one tissue type in the config files). 

> - "progressive" - expression requirement applies to tissue(s) with lowest number of data source types. Tissues with more will require 1 more source to have an active gene per additional source provided for the gene to be active in the model
        
> - "regressive" - expression requirement applies to the tissue(s) with largest number of data source types. Tissues with less will require 1 less source to have an active gene per missing source for the gene to be active in the model
                    
> - "flat" - expression requirement is used regardless of differences in number of data sources provided for different tissues

> - "custom" - (Not yet implemented!) an .xlsx file where column one is the tissue type, and column two is the expression requirement. Requires additional argument '--requirements-file' whose value is the filename of this .xlsx file in `/work/data/`
                            
> *Adjusted expression requirement will never resolve to be < 1

- **--no-hc** use this flag to prevent high-confidence gene from overiding expression_requirement (Not yet implemented). 

If the '--no-hc' flag is not used, any gene that was determined to be high-confidence in any input data source will cause the gene to be active in the final model, regardless of agreement with other sources.**
<br />

If a gene is NA, meaning it was tested for in the library of that data source, it will subtract one from the expression requirement as long as this will not make it < 1

In [121]:
# Step 2.4 Merge the gene lists of data sources, create a list of active gene IDs

expression_requirement = 2
        
requirement_adjust = "progressive" 


cmd = ' '.join(['python3', 'merge_xomics.py', 
      #'--microarray-config-file', '"{}"'.format(microarray_config_file),
      '--total-rnaseq-config-file', '"{}"'.format(trnaseq_config_file),
      '--mrnaseq-config-file', '"{}"'.format(mrnaseq_config_file),
      #'--proteomics-config-file', '"{}"'.format(proteomics_config_file),
      '--expression-requirement', '"{}"'.format(expression_requirement),
      '--requirement-adjust', '"{}"'.format(requirement_adjust)])
!{cmd}

[0;1;31mSystem has not been booted with systemd as init system (PID 1). Can't operate.[0m
[0;1;31mFailed to create bus connection: Host is down[0m
trnaseq_data_inputs_auto.xlsx
mrnaseq_data_inputs_auto.xlsx
Will merge data for: ['m0Macro', 'm1Macro', 'm2Macro']
progressive:  2
Merging data for m0Macro
Read from /home/jupyteruser/work/data/results/m0Macro/rnaseq_total_m0Macro.csv
Read from /home/jupyteruser/work/data/results/m0Macro/rnaseq_mrna_m0Macro.csv
dummy
dummy
m0Macro
m0Macro
dummy
  df_results = df_results.append(dup_rows,ignore_index=True)
  df_results = df_results.append(dup_rows,ignore_index=True)
  df_results = df_results.append(dup_rows,ignore_index=True)
  df_results = df_results.append(dup_rows,ignore_index=True)
  df_results = df_results.append(dup_rows,ignore_index=True)
  df_results = df_results.append(dup_rows,ignore_index=True)
  df_results = df_results.append(dup_rows,ignore_index=True)
  df_results = df_results.append(dup_rows,ignore_index=True)
  df_results =

## Step 4: Create Tissue/Cell-Type Specific Models

In [253]:
# Load the output of step 1, which is a dictionary that specifies the merged list of active Gene IDs for each tissue

step1_results_file = os.path.join(configs.rootdir, 'data', 'results', 'step1_results_files.json')
with open(step1_results_file) as json_file:
    tissue_gene_exp = json.load(json_file)
print(tissue_gene_exp)

{'m0Macro': '/home/jupyteruser/work/data/results/m0Macro/GeneExpression_m0Macro_Merged.csv', 'm1Macro': '/home/jupyteruser/work/data/results/m1Macro/GeneExpression_m1Macro_Merged.csv', 'm2Macro': '/home/jupyteruser/work/data/results/m2Macro/GeneExpression_m2Macro_Merged.csv'}


*** Specify input files for step 2 here ***

In [None]:
# create tissue specific model, the names of output files are stored in dictionary tissue_spec_model

general_model_file = "GeneralModel.mat"
#exclude_rxns_file = os.path.join(configs.datadir, "inconsistant_rxns.csv") # flux inconsistant rxns to remove from core reactions in fastcore
#force_rxns_file = os.path.join(configs.datadir, "lit_core_rxns.csv") 
#boundary_rxns_file = os.path.join(configs.datadir, "exchange_rxns,csv")
recon_algorithm = 'FASTCORE' # troppo reconstruction algorithm to use
solver = "GUROBI"
objective = 'biomass_reaction_Mphage'
#objective = 'BIOMASS_reaction'

for key,value in tissue_gene_exp.items():
    tissuefile = '{}_SpecificModel.mat'.format(key) # key is == tissue name
    tissue_gene_file = re.split('/|\\\\', value)[-1]
    #tissue_gene_folder = os.path.join(configs.rootdir, 'data', key)
    #os.makedirs(tissue_gene_folder, exist_ok=True)
    cmd = ' '.join(['python3', 'create_tissue_specific_model.py', 
                      '--tissue-name', '"{}"'.format(key),
                      '--reference-model-file', '"{}"'.format(general_model_file), 
                      '--gene-expression-file', '"{}"'.format(tissue_gene_file),
                      '--objective', '"{}"'.format(objective),
                      #'--boundary-reactions-file', '"{}"'.format(boundary_rxns_file),
                      #'--exclude-reactions-file', '"{}"'.format(exclude_rxns_file),
                      #'--force-reactions-file', '"{}"'.format(force_rxns_file),
                      '--algorithm', '"{}"'.format(recon_algorithm),
                      '--solver', '"{}"'.format(solver)])
    !{cmd}

Tissue Name is "m0Macro"
General Model file is "GeneralModel.mat"
Gene Expression file is "GeneExpression_m0Macro_Merged.csv"
Objective function is "biomass_reaction_Mphage".
Constructing model with "FASTCORE" reconstruction algorithm using "GUROBI" solver
Set parameter WLSAccessID
Set parameter WLSSecret
Set parameter LicenseID
Academic license - for non-commercial use only - registered to bbessell2@unl.edu


## Step 5: Identifying disease related genes by analyzing transcriptomics data of patients
Differential Expression Analysis

In the config_sheets folder, there should be a folder called "disease". You can add a spreadsheet for each cell/tissue type called `disease_data_inputs_<tissue_name>`. Each sheet of this file should correspond to a seperate disease to analyze using DGE nfor that tissue. The source data can be either microarray or bulk RNA-seq and is formatted the same as if creating the base tissue model. The sheet names should contain the disease name, an underscore, and than either "microarray" or "bulk" depending on the source data. For example, if the disease is lupus, and the source data is bulk RNA-seq, the name of the sheet should be "lupus_bulk". This can be seen in the example sheet. If using bulk RNA-seq data, there should be a count matrix file in `/work/data/data_matrices/<tissue_name>/disease/` called `BulkRNAseqDataMatrix_<disease name>_<tissue name>`. 

*** Specify input files for step 3 here ***

In [13]:
# specify tissue names to perform a disease analysis on. The diseases to analyze should be
# specified in `/work/data/config_sheets/disease/diease_data_inputs_<tissue name>`
tissue_names = ['Naive']

In [14]:
# Differential gene expression analysis
for tissue_name in tissue_names:
    disease_config_file = "".join(["disease_data_inputs_", tissue_name, ".xlsx"])
    cmd = ' '.join(['python3', 'disease_analysis.py',
                  '-t', '"{}"'.format(tissue_name),
                  '-c', '"{}"'.format(disease_config_file)])
    !{cmd}

Config file is at  /home/jupyteruser/work/data/config_sheets/disease/disease_data_inputs_Naive.xlsx
Count Matrix File is at  /home/jupyteruser/work/data/data_matrices/Naive/disease/BulkRNAseqDataMatrix_lupus_Naive.csv
[1] "Reading Counts Matrix"
[1] "Performing DGE"
Traceback (most recent call last):
  File "disease_analysis.py", line 186, in <module>
    main(sys.argv[1:])
  File "disease_analysis.py", line 120, in main
    data2 = DGEio.DGE_main(count_matrix_path, inqueryFullPath, tissue_name, disease_name)
  File "/usr/local/lib/python3.8/dist-packages/rpy2/robjects/functions.py", line 198, in __call__
    return (super(SignatureTranslatedFunction, self)
  File "/usr/local/lib/python3.8/dist-packages/rpy2/robjects/functions.py", line 125, in __call__
    res = super(Function, self).__call__(*new_args, **new_kwargs)
  File "/usr/local/lib/python3.8/dist-packages/rpy2/rinterface_lib/conversion.py", line 45, in _
    cdata = function(*args, **kwargs)
  File "/usr/local/lib/python3.8/di

## Step 6: Identification of drug targets and repurposable drugs
This step maps drug targets in metabolic models,prforms knock out simulation, and compare simulation results with disease genes and identifies drug targets and repurposable drugs

*** Specify input files for step 4 here ***

1. Instruction: A processed Drug-Target file is included in the `/root/pipelines/data/`. (Optional step) For the updated versions the users can download `Repurposing_Hub_export.txt` from [Drug Repurposing Hub](https://clue.io/repurposing-app). From the downloaded file first remove all the activators, agonists, and withdrawn drugs and then upload to to `/root/pipelines/data/`.

2. To use automatically created tissue specific models. Note: It is recommended to use refined and validated models for further analysis. User can define cutomized models in next sub-step.

In [15]:
# tissue specific models
tissue_spec_model 

{'Naive': 'Naive_SpecificModel.mat'}

3. To use customized model, please specify `tissue_spec_model` manually, e.g. uncomment tissue_spec_model in the following cell.

In [16]:
# Manually specify Up and Down Regulated Genes for Disease. (Please upload manually created files `/pipelines/data/`. Use filenames as given belwo or change them accordingly.)
# Disease_Down = 'Disease_DOWN.txt'
# Disease_Up = 'Disease_UP.txt'
# drug_raw_file = 'Repurposing_Hub_export.txt'

# Manually specify tissue specific models fine-tuned by user. Change names of the files accordingly. Users can use single or multiple models here. Using multiple models, simulation time will increase.
# tissue_spec_model = {'Th1':'Th1Model.mat',
#                      'Th2':'Th2Model.mat',
#                      'Th17':'Th17Model.mat',
#                      'Naive':'NaiveModel.mat'}

# Manually specify tissue specific model created by matlab cobratoolbox. For example run, we have provided four models of CD4+ T cells (niave, Th1, Th2, and Th17) please uncomment all or any specific model
# tissue_spec_model = {'Th1':'Th1_SpecificModel_matlab.mat',
#                      'Th2':'Th2_SpecificModel_matlab.mat',
#                      'Th17':'Th17_SpecificModel_matlab.mat',
#                      'Naive':'Naive_SpecificModel_matlab.mat'}


In [17]:
# Knock out simulation for the analyzed tissues and diseases
diseases = ['lupus', 'arthritis']
for key,value in tissue_spec_model.items():
    for dis in diseases:
        # load the results of step 3 to dictionary 'disease_files'
        step3_results_file = os.path.join(configs.datadir, 'results', key, 
                                          dis, 'step2_results_files.json')
        with open(step3_results_file) as json_file:
            disease_files = json.load(json_file)
        #print(disease_files)
        Disease_Down = disease_files['DN_Reg']
        Disease_Up = disease_files['UP_Reg']
        drug_raw_file = 'Repurposing_Hub_export.txt'
        
        out_dir = os.path.join(configs.datadir, "results", key, dis)
        tissueSpecificModelfile  = os.path.join(configs.datadir, "results", key, value)
        print(tissueSpecificModelfile)
        tissue_gene_folder = os.path.join(configs.datadir, key)
        os.makedirs(tissue_gene_folder, exist_ok=True)
        inhibitors_file = '{}_inhibitors_Entrez.txt'.format(key)
        cmd = ' '.join(['python3' , 'knock_out_simulation.py',
                      '-t', tissueSpecificModelfile,
                      '-i', inhibitors_file,
                      '-u', Disease_Up,
                      '-d', Disease_Down,
                      '-f', out_dir,
                      '-r', drug_raw_file])
        !{cmd}

        # copy generated output to output folder
        cmd = ' '.join(['cp', '-a', os.path.join(configs.datadir, key), configs.outputdir])
        !{cmd}
        #break


FileNotFoundError: [Errno 2] No such file or directory: '/home/jupyteruser/work/data/results/Naive/lupus/step2_results_files.json'