## Instructions

This jupyter notebook run MADRID pipeline to identify drug targets and repurposing drugs for user-defined complex human diseases. The entire process contains four steps:
1. Download and analyze transcriptomics and proteomics data, output a list of active genes.
2. Create tissue specific models based on the list of active genes. If required the user can manually refine these models and supply them in Step 4. 
3. Identifying differential gene expressions from disease datasets.
4. Identifying drug targets and repruposable drugs. This step consists of four substeps. 
 (i) mapping drugs on automatically created or user-supplied models, (ii) knock-out simulation, (iii) compare simulation results of perturbed and unperturbed models, and (iv) integrate with disease genes and score drug targets.

The users needs to create the input files for each step and upload input files to the docker container `/root/pipelines/data/`, and specify the input files in this notebook. In the original docker image, some exemplary input files are included to build metabolic models of naive, Th1, Th2, and Th17 subtypes and identify drug targets for rheumatoid arthritis. User should follow the documentation and the format of the exemplary input files to create your own input files.

In [28]:
# import necessary python packages
import sys
import os
import pandas
import numpy
import json
import re
from subprocess import call
from project import configs

# print root path of the project
print(configs.rootdir) 

G:/GitHub/MADRID/docker/pipelines/


## Step 1: Identifying gene activity by analyzing transcriptomics and proteomics datasets

*** Specify input files for step 1 here ***

If proteomics data is not availabe, use:

proteomics_data_file = 'dummy_proteomics_data.xlsx'

proteomics_config_file = 'dummy_proteomics_config.xlsx'

In [5]:
# Step 0: Generate count matrix from gene counts files generated from STAR

technique = "quantile" # technique for bulk RNA-seq active gene determination
                        # for count matrix gen, only used to determine whether or not
                        # picard output mean fragment sizes are required.
        
input_dir = "G:/GitHub/New Folder/MADRID_olddev/docker/pipelines/py/data/bulkData/NaiveB/"
output_dir = "G:/GitHub/MADRID/docker/pipelines/data/"
        
cmd = ' '.join(['python', 'generateCountMatrix.py',
                '-i', '"{}"'.format(input_dir),
                '-o', '"{}"'.format(output_dir),
                '-t', '"{}"'.format(technique)])
#print(cmd)
!{cmd}
# Alternatively, the gene count matrix for RNA-seq can be crafted any other way desired
# and this step can be skipped

['generateCountMatrix.py', '-i', 'G:/GitHub/New Folder/MADRID_olddev/docker/pipelines/py/data/bulkData/NaiveB/', '-o', 'G:/GitHub/MADRID/docker/pipelines/data/', '-t', 'quantile']
Input directory is "G:/GitHub/New Folder/MADRID_olddev/docker/pipelines/py/data/bulkData/NaiveB/"
Output file is "G:/GitHub/MADRID/docker/pipelines/data/"
Active gene determination technique is "quantile"


  from pandas.core.index import Index as PandasIndex


In [18]:
# wd for development

# Specific input files for step 1

# config file for transcriptomics (microarray)
transcriptomics_config_file = 'transcriptomics_data_inputs.xlsx'

# data file for bulk rna-seq
bulk_data_file = 'BulkRNAseqDataMatrix.csv'

# config for bulk rna-seq
bulk_config_file = 'bulk_data_inputs.csv'

# data file for proteomics
proteomics_data_file = 'ProteomicsDataMatrix.xlsx' 

# config file for proteomics
proteomics_config_file = 'proteomics_data_inputs.xlsx'

# proportion of replicates required for a gene to be considered active in that sample

expression_proportion = 0.5
# if gene is in the top nth percentile in any sample it is considered high confidence and will be considered
# expressed regardless of the results of other methods
top_percentile =  10

In [3]:
# Step 1.1 Download and analyze transcriptomics
cmd = ' '.join(['python', 'transcriptomic_gen.py', 
      '-i', '"{}"'.format(transcriptomics_config_file)])
!{cmd}

Input file is " transcriptomics_data_inputs.xlsx
G:/GitHub/MADRID/docker/pipelines/data\transcriptomics_data_inputs.xlsx
---
Start Collecting Data for:
['GSE22886' 'GSE43005' 'GSE22045' 'GSE24634']
['GSM565273' 'GSM565274' 'GSM565275' 'GSM565290' 'GSM565291' 'GSM565292'
 'GSM1054773' 'GSM1054779' 'GSM1054781' 'GSM1054789' 'GSM548000'
 'GSM548001' 'GSM607510' 'GSM607511' 'GSM607512']
---

Initialize project (GSE22886):
Root: G:/GitHub/MADRID/docker/pipelines/
Raw data: G:/GitHub/MADRID/docker/pipelines/data\GSE22886_RAW
Retrieve Sample: G:/GitHub/MADRID/docker/pipelines/data\GSE22886_RAW\GSM565273.tar
Extract to: G:/GitHub/MADRID/docker/pipelines/data\GSE22886_RAW\GPL96
Retrieve Sample: G:/GitHub/MADRID/docker/pipelines/data\GSE22886_RAW\GSM565274.tar
Extract to: G:/GitHub/MADRID/docker/pipelines/data\GSE22886_RAW\GPL96
Retrieve Sample: G:/GitHub/MADRID/docker/pipelines/data\GSE22886_RAW\GSM565275.tar
Extract to: G:/GitHub/MADRID/docker/pipelines/data\GSE22886_RAW\GPL96
Retrieve Sample:

  from pandas.core.index import Index as PandasIndex
10-May-2021 13:51:38 INFO GEOparse - Downloading ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE22nnn/GSE22886/soft/GSE22886_family.soft.gz to ./GSE22886_family.soft.gz
10-May-2021 13:51:38 INFO utils - Downloading ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE22nnn/GSE22886/soft/GSE22886_family.soft.gz to ./GSE22886_family.soft.gz

10-May-2021 13:51:45 INFO GEOparse - Parsing ./GSE22886_family.soft.gz: 
10-May-2021 13:51:45 DEBUG GEOparse - DATABASE: GeoMiame
10-May-2021 13:51:45 DEBUG GEOparse - SERIES: GSE22886
10-May-2021 13:51:45 DEBUG GEOparse - PLATFORM: GPL96
Traceback (most recent call last):
  File "transcriptomic_gen.py", line 347, in <module>
    main(sys.argv[1:])
  File "transcriptomic_gen.py", line 338, in main
    df_output = queryTest(df)
  File "transcriptomic_gen.py", line 140, in queryTest
    updateTranscriptomicsDB(gseXXX)
  File "transcriptomic_gen.py", line 84, in updateTranscriptomicsDB
    df_clean = gseXXX.get_entrez_


D: 0% - 68.0KiB  / 41.1MiB  eta unknown 
D: 0% - 72.0KiB  / 41.1MiB  eta unknown 
D: 0% - 76.0KiB  / 41.1MiB  eta unknown 
D: 0% - 80.0KiB  / 41.1MiB  eta unknown 
D: 0% - 84.0KiB  / 41.1MiB  eta unknown 
D: 0% - 88.0KiB  / 41.1MiB  eta unknown 
D: 0% - 92.0KiB  / 41.1MiB  eta unknown 
D: 0% - 96.0KiB  / 41.1MiB  eta unknown 
D: 0% - 100.0KiB / 41.1MiB  eta unknown 
D: 0% - 104.0KiB / 41.1MiB  eta unknown 
D: 0% - 108.0KiB / 41.1MiB  eta unknown 
D: 0% - 112.0KiB / 41.1MiB  eta unknown 
D: 0% - 116.0KiB / 41.1MiB  eta unknown 
D: 0% - 120.0KiB / 41.1MiB  eta unknown 
D: 0% - 124.0KiB / 41.1MiB  eta unknown 
D: 0% - 128.0KiB / 41.1MiB  eta unknown 
D: 0% - 132.0KiB / 41.1MiB  eta unknown 
D: 0% - 136.0KiB / 41.1MiB  eta unknown 
D: 0% - 140.0KiB / 41.1MiB  eta unknown 
D: 0% - 144.0KiB / 41.1MiB  eta unknown 
D: 0% - 148.0KiB / 41.1MiB  eta unknown 
D: 0% - 152.0KiB / 41.1MiB  eta unknown 
D: 0% - 156.0KiB / 41.1MiB  eta unknown 
D: 0% - 160.0KiB / 41.1MiB  eta unknown 
D: 0% - 164.0Ki

D: 2% -  1.0MiB  / 41.1MiB  eta unknown 
D: 2% -  1.0MiB  / 41.1MiB  eta unknown 
D: 2% -  1.0MiB  / 41.1MiB  eta unknown 
D: 2% -  1.0MiB  / 41.1MiB  eta unknown 
D: 2% -  1.0MiB  / 41.1MiB  eta unknown 
D: 2% -  1.0MiB  / 41.1MiB  eta unknown 
D: 2% -  1.0MiB  / 41.1MiB  eta unknown 
D: 2% -  1.1MiB  / 41.1MiB  eta unknown 
D: 2% -  1.1MiB  / 41.1MiB  eta unknown 
D: 2% -  1.1MiB  / 41.1MiB  eta unknown 
D: 2% -  1.1MiB  / 41.1MiB  eta unknown 
D: 2% -  1.1MiB  / 41.1MiB  eta unknown 
D: 2% -  1.1MiB  / 41.1MiB  eta unknown 
D: 2% -  1.1MiB  / 41.1MiB  eta unknown 
D: 2% -  1.1MiB  / 41.1MiB  eta unknown 
D: 2% -  1.1MiB  / 41.1MiB  eta unknown 
D: 2% -  1.1MiB  / 41.1MiB  eta unknown 
D: 2% -  1.1MiB  / 41.1MiB  eta unknown 
D: 2% -  1.1MiB  / 41.1MiB  eta unknown 
D: 2% -  1.1MiB  / 41.1MiB  eta unknown 
D: 2% -  1.1MiB  / 41.1MiB  eta unknown 
D: 2% -  1.1MiB  / 41.1MiB  eta unknown 
D: 2% -  1.1MiB  / 41.1MiB  eta unknown 
D: 2% -  1.1MiB  / 41.1MiB  eta unknown 
D: 2% -  1.1MiB 

D: 18% -  7.7MiB  / 41.1MiB  eta unknown 
D: 18% -  7.7MiB  / 41.1MiB  eta unknown 
D: 18% -  7.7MiB  / 41.1MiB  eta unknown 
D: 18% -  7.7MiB  / 41.1MiB  eta unknown 
D: 18% -  7.7MiB  / 41.1MiB  eta unknown 
D: 18% -  7.7MiB  / 41.1MiB  eta unknown 
D: 18% -  7.7MiB  / 41.1MiB  eta unknown 
D: 18% -  7.7MiB  / 41.1MiB  eta unknown 
D: 18% -  7.7MiB  / 41.1MiB  eta unknown 
D: 18% -  7.7MiB  / 41.1MiB  eta unknown 
D: 18% -  7.7MiB  / 41.1MiB  eta unknown 
D: 18% -  7.8MiB  / 41.1MiB  eta unknown 
D: 18% -  7.8MiB  / 41.1MiB  eta unknown 
D: 18% -  7.8MiB  / 41.1MiB  eta unknown 
D: 18% -  7.8MiB  / 41.1MiB  eta unknown 
D: 18% -  7.8MiB  / 41.1MiB  eta unknown 
D: 18% -  7.8MiB  / 41.1MiB  eta unknown 
D: 18% -  7.8MiB  / 41.1MiB  eta unknown 
D: 18% -  7.8MiB  / 41.1MiB  eta unknown 
D: 18% -  7.8MiB  / 41.1MiB  eta unknown 
D: 18% -  7.8MiB  / 41.1MiB  eta unknown 
D: 18% -  7.8MiB  / 41.1MiB  eta unknown 
D: 18% -  7.8MiB  / 41.1MiB  eta unknown 
D: 18% -  7.8MiB  / 41.1MiB  eta u

D: 29% - 12.3MiB  / 41.1MiB  eta 0:00:05
D: 29% - 12.3MiB  / 41.1MiB  eta 0:00:05
D: 29% - 12.3MiB  / 41.1MiB  eta 0:00:05
D: 29% - 12.3MiB  / 41.1MiB  eta 0:00:05
D: 29% - 12.3MiB  / 41.1MiB  eta 0:00:05
D: 29% - 12.3MiB  / 41.1MiB  eta 0:00:05
D: 29% - 12.3MiB  / 41.1MiB  eta 0:00:05
D: 30% - 12.3MiB  / 41.1MiB  eta 0:00:05
D: 30% - 12.3MiB  / 41.1MiB  eta 0:00:05
D: 30% - 12.3MiB  / 41.1MiB  eta 0:00:05
D: 30% - 12.3MiB  / 41.1MiB  eta 0:00:05
D: 30% - 12.3MiB  / 41.1MiB  eta 0:00:05
D: 30% - 12.3MiB  / 41.1MiB  eta 0:00:05
D: 30% - 12.4MiB  / 41.1MiB  eta 0:00:05
D: 30% - 12.4MiB  / 41.1MiB  eta 0:00:05
D: 30% - 12.4MiB  / 41.1MiB  eta 0:00:05
D: 30% - 12.4MiB  / 41.1MiB  eta 0:00:05
D: 30% - 12.4MiB  / 41.1MiB  eta 0:00:05
D: 30% - 12.4MiB  / 41.1MiB  eta 0:00:05
D: 30% - 12.4MiB  / 41.1MiB  eta 0:00:05
D: 30% - 12.4MiB  / 41.1MiB  eta 0:00:05
D: 30% - 12.4MiB  / 41.1MiB  eta 0:00:05
D: 30% - 12.4MiB  / 41.1MiB  eta 0:00:05
D: 30% - 12.4MiB  / 41.1MiB  eta 0:00:05
D: 30% - 12.4MiB

D: 44% - 18.3MiB  / 41.1MiB  eta 0:00:05
D: 44% - 18.3MiB  / 41.1MiB  eta 0:00:05
D: 44% - 18.3MiB  / 41.1MiB  eta 0:00:05
D: 44% - 18.3MiB  / 41.1MiB  eta 0:00:05
D: 44% - 18.3MiB  / 41.1MiB  eta 0:00:05
D: 44% - 18.3MiB  / 41.1MiB  eta 0:00:05
D: 44% - 18.3MiB  / 41.1MiB  eta 0:00:05
D: 44% - 18.3MiB  / 41.1MiB  eta 0:00:05
D: 44% - 18.3MiB  / 41.1MiB  eta 0:00:05
D: 44% - 18.3MiB  / 41.1MiB  eta 0:00:05
D: 44% - 18.4MiB  / 41.1MiB  eta 0:00:05
D: 44% - 18.4MiB  / 41.1MiB  eta 0:00:05
D: 44% - 18.4MiB  / 41.1MiB  eta 0:00:05
D: 44% - 18.4MiB  / 41.1MiB  eta 0:00:05
D: 44% - 18.4MiB  / 41.1MiB  eta 0:00:05
D: 44% - 18.4MiB  / 41.1MiB  eta 0:00:05
D: 44% - 18.4MiB  / 41.1MiB  eta 0:00:05
D: 44% - 18.4MiB  / 41.1MiB  eta 0:00:05
D: 44% - 18.4MiB  / 41.1MiB  eta 0:00:05
D: 44% - 18.4MiB  / 41.1MiB  eta 0:00:05
D: 44% - 18.4MiB  / 41.1MiB  eta 0:00:05
D: 44% - 18.4MiB  / 41.1MiB  eta 0:00:05
D: 44% - 18.4MiB  / 41.1MiB  eta 0:00:05
D: 44% - 18.4MiB  / 41.1MiB  eta 0:00:05
D: 44% - 18.4MiB

D: 59% - 24.6MiB  / 41.1MiB  eta 0:00:03
D: 59% - 24.6MiB  / 41.1MiB  eta 0:00:03
D: 59% - 24.6MiB  / 41.1MiB  eta 0:00:03
D: 59% - 24.6MiB  / 41.1MiB  eta 0:00:03
D: 59% - 24.6MiB  / 41.1MiB  eta 0:00:03
D: 59% - 24.6MiB  / 41.1MiB  eta 0:00:03
D: 59% - 24.6MiB  / 41.1MiB  eta 0:00:03
D: 59% - 24.6MiB  / 41.1MiB  eta 0:00:03
D: 59% - 24.6MiB  / 41.1MiB  eta 0:00:03
D: 59% - 24.6MiB  / 41.1MiB  eta 0:00:03
D: 59% - 24.6MiB  / 41.1MiB  eta 0:00:03
D: 59% - 24.6MiB  / 41.1MiB  eta 0:00:03
D: 59% - 24.6MiB  / 41.1MiB  eta 0:00:03
D: 59% - 24.6MiB  / 41.1MiB  eta 0:00:03
D: 59% - 24.7MiB  / 41.1MiB  eta 0:00:03
D: 60% - 24.7MiB  / 41.1MiB  eta 0:00:03
D: 60% - 24.7MiB  / 41.1MiB  eta 0:00:03
D: 60% - 24.7MiB  / 41.1MiB  eta 0:00:03
D: 60% - 24.7MiB  / 41.1MiB  eta 0:00:03
D: 60% - 24.7MiB  / 41.1MiB  eta 0:00:03
D: 60% - 24.7MiB  / 41.1MiB  eta 0:00:03
D: 60% - 24.7MiB  / 41.1MiB  eta 0:00:03
D: 60% - 24.7MiB  / 41.1MiB  eta 0:00:03
D: 60% - 24.7MiB  / 41.1MiB  eta 0:00:03
D: 60% - 24.7MiB

D: 73% - 30.3MiB  / 41.1MiB  eta 0:00:01
D: 73% - 30.3MiB  / 41.1MiB  eta 0:00:01
D: 73% - 30.3MiB  / 41.1MiB  eta 0:00:01
D: 73% - 30.3MiB  / 41.1MiB  eta 0:00:01
D: 73% - 30.3MiB  / 41.1MiB  eta 0:00:01
D: 73% - 30.3MiB  / 41.1MiB  eta 0:00:01
D: 73% - 30.4MiB  / 41.1MiB  eta 0:00:01
D: 73% - 30.4MiB  / 41.1MiB  eta 0:00:01
D: 73% - 30.4MiB  / 41.1MiB  eta 0:00:01
D: 73% - 30.4MiB  / 41.1MiB  eta 0:00:01
D: 73% - 30.4MiB  / 41.1MiB  eta 0:00:01
D: 73% - 30.4MiB  / 41.1MiB  eta 0:00:01
D: 73% - 30.4MiB  / 41.1MiB  eta 0:00:01
D: 73% - 30.4MiB  / 41.1MiB  eta 0:00:01
D: 73% - 30.4MiB  / 41.1MiB  eta 0:00:01
D: 73% - 30.4MiB  / 41.1MiB  eta 0:00:01
D: 73% - 30.4MiB  / 41.1MiB  eta 0:00:01
D: 73% - 30.4MiB  / 41.1MiB  eta 0:00:01
D: 73% - 30.4MiB  / 41.1MiB  eta 0:00:01
D: 73% - 30.4MiB  / 41.1MiB  eta 0:00:01
D: 73% - 30.4MiB  / 41.1MiB  eta 0:00:01
D: 74% - 30.4MiB  / 41.1MiB  eta 0:00:01
D: 74% - 30.4MiB  / 41.1MiB  eta 0:00:01
D: 74% - 30.4MiB  / 41.1MiB  eta 0:00:01
D: 74% - 30.4MiB

D: 89% - 36.6MiB  / 41.1MiB  eta 0:00:01
D: 89% - 36.6MiB  / 41.1MiB  eta 0:00:01
D: 89% - 36.6MiB  / 41.1MiB  eta 0:00:01
D: 89% - 36.6MiB  / 41.1MiB  eta 0:00:01
D: 89% - 36.6MiB  / 41.1MiB  eta 0:00:01
D: 89% - 36.6MiB  / 41.1MiB  eta 0:00:01
D: 89% - 36.6MiB  / 41.1MiB  eta 0:00:01
D: 89% - 36.6MiB  / 41.1MiB  eta 0:00:01
D: 89% - 36.6MiB  / 41.1MiB  eta 0:00:01
D: 89% - 36.6MiB  / 41.1MiB  eta 0:00:01
D: 89% - 36.6MiB  / 41.1MiB  eta 0:00:01
D: 89% - 36.6MiB  / 41.1MiB  eta 0:00:01
D: 89% - 36.6MiB  / 41.1MiB  eta 0:00:01
D: 89% - 36.6MiB  / 41.1MiB  eta 0:00:01
D: 89% - 36.6MiB  / 41.1MiB  eta 0:00:01
D: 89% - 36.7MiB  / 41.1MiB  eta 0:00:01
D: 89% - 36.7MiB  / 41.1MiB  eta 0:00:01
D: 89% - 36.7MiB  / 41.1MiB  eta 0:00:01
D: 89% - 36.7MiB  / 41.1MiB  eta 0:00:01
D: 89% - 36.7MiB  / 41.1MiB  eta 0:00:01
D: 89% - 36.7MiB  / 41.1MiB  eta 0:00:01
D: 89% - 36.7MiB  / 41.1MiB  eta 0:00:01
D: 89% - 36.7MiB  / 41.1MiB  eta 0:00:01
D: 89% - 36.7MiB  / 41.1MiB  eta 0:00:01
D: 89% - 36.7MiB

In [9]:
# step 1.2 Analyze Bulk-RNA-seq 

# Bulk-RNA-seq can handle many more parameters, 
# bulk_data_file, bulk_config_file, gene_format, and species_dataset are required.

gene_format = "ensembl" # gene format in count file for biomart
species_dataset = "human" # species dataset for biomart
exp_prop_rep = 0.5  # proportion of replicates for a gene to be active in a sample
exp_prop_samp = 0.5 # proportion of samples with expression required for gene   
top_percentile = 25 # any replicate with expression in this percentile is expressed, regardless of other sources  
technique = "quantile" # quantile, cpm, or zFPKM
quantile = 90 # only used with quantile

cmd = ' '.join(['python', 'bulk_gen.py', 
      '-f', '"{}"'.format(bulk_data_file),   # bulk rna-seq data sheet (required)
      '-c', '"{}"'.format(bulk_config_file), # config file for bulk RNA-seq (required)
      '-g', '"{}"'.format(gene_format),      # gene format in count file for biomart (required)
      '-d', '"{}"'.format(species_dataset),  # species dataset for biomart (required)
      '-r', '"{}"'.format(exp_prop_rep),     # proportion of replicates for a gene to be active in a sample
      '-s', '"{}"'.format(exp_prop_samp),    # proportion of samples with expression required for gene       
      '-p', '"{}"'.format(top_percentile),   # top percentile 
      '-t', '"{}"'.format(technique),        # technique for filtering and normalization    
      '-q', '"{}"'.format(quantile)])         # cutoff TPM quantile for quantile techique
                
!{cmd}

Data file is "BulkRNAseqDataMatrix.csv"
Supplementary Data file is "bulk_data_inputs.csv"
G:/GitHub/MADRID/docker/pipelines/data\Bulk_Naive.csv
Test data saved to G:/GitHub/MADRID/docker/pipelines/data\Bulk_Naive.csv


  from pandas.core.index import Index as PandasIndex


In [6]:
# Step 1.3 Analyze proteomics
cmd = ' '.join(['python', 'proteomics_gen.py', 
      '-d', '"{}"'.format(proteomics_data_file), 
      '-s', '"{}"'.format(proteomics_config_file),
      '-e', '"{}"'.format(expression_proportion),
      '-p', '"{}"'.format(top_percentile)])
!{cmd}

Data file is "ProteomicsDataMatrix.xlsx"
Supplementary Data file is "proteomics_data_inputs.xlsx"
                                Naïve 
0     CopyNumber_T4.naive_01_activated
1     CopyNumber_T4.naive_02_activated
2     CopyNumber_T4.naive_03_activated
3     CopyNumber_T4.naive_04_activated
4  CopyNumber_T4.naive_01_steady-state
5  CopyNumber_T4.naive_02_steady-state
6  CopyNumber_T4.naive_03_steady-state
7  CopyNumber_T4.naive_04_steady-state
Test Data Saved to G:/GitHub/MADRID/docker/pipelines/data\Proteomics_Naive.csv


  from pandas.core.index import Index as PandasIndex


In [12]:
# Step 1.4 Merge the gene lists of transcriptomics and proteomics, create a list of active gene IDs
cmd = ' '.join(['python', 'merge_xomics.py', 
      #'-t', '"{}"'.format(transcriptomics_config_file),
      '-b', '"{}"'.format(bulk_config_file),
      '-p', '"{}"'.format(proteomics_config_file),])
!{cmd}

Transcriptomics file is "None"
Proteomics file is "proteomics_data_inputs.xlsx"
Bulk RNA-seq file is "bulk_data_inputs.csv"
                                Naïve 
0     CopyNumber_T4.naive_01_activated
1     CopyNumber_T4.naive_02_activated
2     CopyNumber_T4.naive_03_activated
3     CopyNumber_T4.naive_04_activated
4  CopyNumber_T4.naive_01_steady-state
5  CopyNumber_T4.naive_02_steady-state
6  CopyNumber_T4.naive_03_steady-state
7  CopyNumber_T4.naive_04_steady-state
proteomics exists
Naïve 
Test Data Load From G:/GitHub/MADRID/docker/pipelines/data\Proteomics_Naive.csv
{'Naive':                 CopyNumber_T4.naive_01_activated  ...  top
ENTREZ_GENE_ID                                    ...     
1                                          42776  ...    0
2                                         341085  ...    0
144568                                         0  ...    0
8086                                      109983  ...    0
65985                                      13945  ...   

  from pandas.core.index import Index as PandasIndex



2831                    0    0
1038                    0    0

[25639 rows x 2 columns]}
dict_keys(['Naive'])
dict_keys(['Naive'])
{'Naive'}
set()
keys1 generated
Naive
                prote_exp  prote_top
ENTREZ_GENE_ID                      
1                       1          0
2                       1          0
144568                  1          0
8086                    1          0
65985                   1          0
...                   ...        ...
440590                  1          0
79699                   1          0
7791                    1          1
23140                   1          0
26009                   0          0

[8768 rows x 2 columns]
                bulk_exp  bulk_top
ENTREZ_GENE_ID                    
7105                   0         0
64102                  0         0
8813                   0         0
57147                  0         0
55732                  0         0
...                  ...       ...
105370174              0         0
100533105

## Step 2: Create tissue-specific or cell-type-specific Models

In [16]:
# Load the output of step 1, which is a dictionary that specifies the merged list of active Gene IDs for each tissue

step1_results_file = os.path.join(configs.rootdir, 'data', 'step1_results_files.json')
with open(step1_results_file) as json_file:
    tissue_gene_exp = json.load(json_file)
print(tissue_gene_exp)

{'Naive': 'G:/GitHub/MADRID/docker/pipelines/data\\GeneExpression_Naive_Merged.csv'}


*** Specify input files for step 2 here ***

In [20]:
# (input) filename of General Model, Recon3D_Teff_ver2
GeneralModelFile = 'GeneralModel.mat'

# (input) filename of Tissue Gene Expression
# genefile = 'merged_Th1.csv'

# (output) filename of Tissue Specific Model
# tissuefile = 'Th1_SpecificModel.mat'

In [37]:
# create tissue specific model, the names of output files are stored in dictionary tissue_spec_model
tissue_spec_model = {}

for key,value in tissue_gene_exp.items():
    tissuefile = '{}_SpecificModel.mat'.format(key)
    tissue_spec_model[key] = tissuefile
    tissue_gene_file = re.split('/|\\\\', value)[-1]
    tissue_gene_folder = os.path.join(configs.rootdir, 'data', key)
    os.makedirs(tissue_gene_folder, exist_ok=True)
    cmd = ' '.join(['python', 'create_tissue_specific_model.py', 
                      '-m', '"{}"'.format(GeneralModelFile), 
                      '-g', '"{}"'.format(tissue_gene_file),
                      '-o', '"{}"'.format(tissuefile)])
    !{cmd}

print(tissue_spec_model)

General Model file is "GeneralModel.mat"
Gene Expression file is "GeneExpression_Naive_Merged.csv"
Output file is "Naive_SpecificModel.mat"
(25713, 2)
(0, 2)
(0, 2)
(25713, 2)
Map gene expression to reactions, 0 errors.
1's: 5858
2's: 0
model
Genes: 666
Metabolites: 5174
Reactions: 5858
1.0*biomass_reaction_Mphage - 1.0*biomass_reaction_Mphage_reverse_6cff5
<Solution 0.000 at 0x19dd13d0908>
{'Naive': 'Naive_SpecificModel.mat'}


  'Will not normalize rules with more than ' + str(token_to_gene_ratio) + ' average tokens per gene')
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  expVector = np.array(list(expressionRxns.values()),dtype=np.float)


## Step 3: Identifying disease related genes by analyzing transcriptomics data of patients
Differential Expression Analysis

Only 1 disease to be analyzed, output files in data folder

*** Specify input files for step 3 here ***

In [None]:
#input filename transcriptomics data of disease
disease_gene_file = 'disease_transcriptomics_data_inputs.xlsx'

In [None]:
# Differential gene expression analysis
cmd = ' '.join(['python3', 'disease_analysis.py', 
              '-i', '"{}"'.format(disease_gene_file)])
!{cmd}

In [None]:
# load the results of step 3 to dictionary 'disease_files'
step3_results_file = os.path.join(configs.datadir, 'step2_results_files.json')
with open(step3_results_file) as json_file:
    disease_files = json.load(json_file)
print(disease_files)

## Step 4: Identification of drug targets and repurposable drugs
This step maps drug targets in metabolic models,prforms knock out simulation, and compare simulation results with disease genes and identifies drug targets and repurposable drugs

*** Specify input files for step 4 here ***

1. Instruction: A processed Drug-Target file is included in the `/root/pipelines/data/`. (Optional step) For the updated versions the users can download `Repurposing_Hub_export.txt` from [Drug Repurposing Hub](https://clue.io/repurposing-app). From the downloaded file first remove all the activators, agonists, and withdrawn drugs and then upload to to `/root/pipelines/data/`.

2. To use automatically created tissue specific models. Note: It is recommended to use refined and validated models for further analysis. User can define cutomized models in next sub-step.

In [None]:
# tissue specific models
tissue_spec_model

In [None]:
Disease_Down = disease_files['DN_Reg']
Disease_Up = disease_files['UP_Reg']
drug_raw_file = 'Repurposing_Hub_export.txt'

3. To use customized model, please specify `tissue_spec_model` manually, e.g. uncomment tissue_spec_model in the following cell.

In [None]:
# Manually specify Up and Down Regulated Genes for Disease. (Please upload manually created files `/pipelines/data/`. Use filenames as given belwo or change them accordingly.)
# Disease_Down = 'Disease_DOWN.txt'
# Disease_Up = 'Disease_UP.txt'
# drug_raw_file = 'Repurposing_Hub_export.txt'

# Manually specify tissue specific models fine-tuned by user. Change names of the files accordingly. Users can use single or multiple models here. Using multiple models, simulation time will increase.
# tissue_spec_model = {'Th1':'Th1Model.mat',
#                      'Th2':'Th2Model.mat',
#                      'Th17':'Th17Model.mat',
#                      'Naive':'NaiveModel.mat'}

# Manually specify tissue specific model created by matlab cobratoolbox. For example run, we have provided four models of CD4+ T cells (niave, Th1, Th2, and Th17) please uncomment all or any specific model
# tissue_spec_model = {'Th1':'Th1_SpecificModel_matlab.mat',
#                      'Th2':'Th2_SpecificModel_matlab.mat',
#                      'Th17':'Th17_SpecificModel_matlab.mat',
#                      'Naive':'Naive_SpecificModel_matlab.mat'}


In [None]:
# Knock out simulation for the analyzed tissues
for key,value in tissue_spec_model.items():
    tissueSpecificModelfile = value
    tissue_gene_folder = os.path.join(configs.datadir, key)
    os.makedirs(tissue_gene_folder, exist_ok=True)
    inhibitors_file = '{}_inhibitors_Entrez.txt'.format(key)
    cmd = ' '.join(['python3' , 'knock_out_simulation.py',
                  '-t', tissueSpecificModelfile,
                  '-i', inhibitors_file,
                  '-u', Disease_Up,
                  '-d', Disease_Down,
                  '-f', key,
                  '-r', drug_raw_file])
    !{cmd}
    
    # copy generated output to output folder
    cmd = ' '.join(['cp', '-a', os.path.join(configs.datadir, key), configs.outputdir])
    !{cmd}
    #break
