# COMO: Constraint-based Optomization of Metabolic Objectives

COMO is used to build computational models that simulate the biochemical and phisiological processes that occur in a cell or organism, known as constraint-based metabolic models. The basic idea behind a constraint-based metabolic model is to use a set of constraints to place boundaries on the system being modeled. These constraints may include (but are not limited to) limits on the availability of nutrients, energy requirements, and the maximum rates of metabolic reactions. COMO imposes these constraints within a specific context. This context includes the cell or tissue type being modeled, along with its disease state. In addition to creating metabolic models, COMO serves as a platform to identify (1) drug targets and (2) repurposable drugs for metabolism-impacting diseases.


This pipeline has everything necessary to build a model from any combination of the following sources:
- Bulk RNA-seq (total and mRNA)
- Single-cell RNA-seq
- Proteomics


COMO does not require programming experience to create models. However, every step of the pipeline is easily accessable to promote modification, addition, or replacement of analysis steps. In addition, this docker container comes pre-loaded with popular R and Python libraries; if you would like to use a library and cannot install it for any reason, please [request it on our GitHub page](https://github.com/HelikarLab/COMO)!


<h2>
<font color='red'>⚠️ WARNING ⚠️</font>
</h2>

If you terminate your session after running Docker, any changes you make *will <ins>**not**</ins> be saved*. Please mount a local directory to the docker image, [as instructed on the GitHub page](https://helikarlab.github.io/COMO/#choosing-a-tag), to prevent data loss.

# Before Starting
## Input Files
The proper input files, dependent on the types of data you are using, must be loaded before model creation. Some example files are included to build metabolic models of naive, Th1, Th2, and Th17 T-cell subtypes, and identify targets for rheumatoid arthritis.

### RNA-seq
A correctly formatted folder named "COMO inputs" in the data directory. Proper inputs can be generated using our Snakemake pipeline, [FastqToGeneCounts](https://github.com/HelikarLab/FastqToGeneCounts), which is specifically designed for use with COMO. RNA sequencing data can be single-cell, or bulk, but the provided Snakemake pipeline does not process single-cell data as of now. If you are processing RNA-seq data with an alternate procedure or importing a pre-made gene count matrix, follow the instructions [listed under Step 1](#Importing-a-Pre-Generated-Counts-Matrix)

### Proteomics
A matrix of measurement values, where rows are protein names in Entrez format and columns are sample names

## Configuration Information
You should upload configuration files (in Excel format, `.xlsx`) to `data/config_sheets`. The sheet names in these configuration files should correspond to the context (tissue name, cell name, etc.). The data in each sheet contains the sample names to include in that context-specific model. These sample names should correspond to the column name in the source data matrix, which will be output (or uploaded, if you have your own data) to `data/data_matrices/MODEL-NAME`

# Drug Target Identification

1. Preprocess Bulk RNA-seq data
    1. Convert STAR-output gene count files into a unified matrix
    2. Fetch necessary information about each gene in the matrix
    3. Generate a configuration file
2. Analyze any combination of RNA-seq or proteomics data, and output a list of active genes for each strategy
3. Check for a consensus amongst strategies according to a desired rigor and merge into a singular set of active genes
4. Create a tissue-specific model based on the list of active genes (from Step 3)
5. Identify differential gene expression from disease datasets using RNA-seq transcriptomics information
6. Identify drug targets and repurposable drugs. This step consists of four substeps:
    1. Map drugs to models
    2. Knock-out simulation
    3. Compare results between perturbed and unperturbed models (i.e., knocked-out models vs non-knocked-out models)
    4. Integrate with disease genes and create a score of drug targets

# Step 1: Data Preprocessing and Analysis

The first step of COMO will perform processing and analysis on each of the following data:
- Total RNA sequencing
- mRNA sequencing
- Proteomics

## RNA-seq Data
RNA sequencing data is read by COMO as a count matrix, where each column is a different sample or replicate named "tissueName_SXRYrZ", where:
- "`X`" represents the study (or batch) number. Each study represents a new experiment
- "`Y`" represents the replicate number
- "`Z`" represents the run number. If the replicate does not contain multiple runs for a single replicate, then "`rZ`" should not be included.
- "`tissueName`" represents the name of the model that will be built from this data. It should be consistent with other data sources if you would like them to be integrated.

❗The `tissueName` identifier should not contain any special characters, including `_`. Doing so may interfere with parsing throughout this pipeline.

Replicates should come from the same study or batch group. Different studies/batches can come from different published studies, as long as the tissue/cell was under similar enough conditions for your personal modeling purposes. "Run numbers" in the same replicate will be summed together.

### Example
Pretend `S1` represents a study done by Margaret and `S2` represents a different study done by John. Margaret's experiment contains three replicates, while John's only contains two. Each of these studies comes from m0 Macrophages. Using this cell name, we will set our tissue name to `m0Macro`. The studies were conducted in different labs, by different researches, at different points in time, even using different preparation kits. . Using this information, we have the following samples:

<table style="border: 1px solid black; border-collapse: collapse;">
    <thead>
        <tr>
            <th colspan="1000" style="text-align: center;">m0 Macrophage Data</th>
        </tr>
    </thead>
    <tbody>
        <tr>
            <td colspan="3" style="padding: 10px; text-align: center; border-bottom: 1px solid black;">Margaret's Data</td>
            <td colspan="3" style="padding: 10px; text-align: center; border-left: 1px solid black; border-bottom: 1px solid black;">John's Data</td>
        </tr>
        <tr>
            <td style="padding: 10px; text-align: center;">Study</td>
            <td style="padding: 10px; text-align: center;">Replicate</td>
            <td style="padding: 10px; text-align: center;">Resulting Name</td>
            <td style="padding: 10px; text-align: center; border-left: 1px solid black;">Study</td>
            <td style="padding: 10px; text-align: center;">Replicate</td>
            <td style="padding: 10px; text-align: center;">Resulting Name</td>
        </tr>
        <tr>
            <td style="padding: 10px; text-align: center;">S1</td>
            <td style="padding: 10px; text-align: center;">R1</td>
            <td style="padding: 10px; text-align: center;">m0Macro_S1R1</td>
            <td style="padding: 10px; text-align: center; border-left: 1px solid black;">S2</td>
            <td style="padding: 10px; text-align: center;">R1</td>
            <td style="padding: 10px; text-align: center;">m0Macro_S2R1</td>
        </tr>
        <tr>
            <td style="padding: 10px; text-align: center;">S1</td>
            <td style="padding: 10px; text-align: center;">R2</td>
            <td style="padding: 10px; text-align: center;">m0Macro_S1R2</td>
            <td style="padding: 10px; text-align: center; border-left: 1px solid black;">S2</td>
            <td style="padding: 10px; text-align: center;">R2</td>
            <td style="padding: 10px; text-align: center;">m0Macro_S2R2</td>
        </tr>
        <tr>
            <td style="padding: 10px; text-align: center;">S1</td>
            <td style="padding: 10px; text-align: center;">R3</td>
            <td style="padding: 10px; text-align: center;">m0Macro_S1R3</td>
            <td style="padding: 10px; text-align: center; border-left: 1px solid black;">-</td>
            <td style="padding: 10px; text-align: center;">-</td>
            <td style="padding: 10px; text-align: center;">-</td>
        </tr>
    </tbody>
</table>

From the `Resulting Name` column, the `m0Macro_S1R1`, `m0Macro_S1R2`, and `m0Macro_S1R3` samples (Margaret's data) will be checked for gene expression consensus to generate a list of active genes in all three replicates. The same will be done for `m0Macro_S2R1` and `m0Macro_S2R2` (John's data). Once these two *separate* lists of active genes have been generated, expression *between* lists will be checked for additional consensus between the studies. This system is used not only to help maintain organization throughout COMO, but because most types of normalized gene counts cannot undergo direct comparisons across replicates. This is especially true for comparisons between different experiments. Therefore, COMO will convert normalized gene counts into a boolean list of active genes. These lists will be compared at the level of replicates in a study, and then again at the level of all provided studies. Finally, the active genes will be merged with the outputs of proteomics and various RNA-sequencing strategies if provided. The rigor used at each level is easily modifiable.


### Initializing RNA-seq Data

Please choose an option below:
1. Importing a `COMO inputs` directory
    1. [Initialization using the Snakemake Pipeline](https://github.com/HelikarLab/FastqToGeneCounts)
    2. [Creating your own Inputs](#Creating-a-Properly-Formatted-COMO-inputs-Folder)
2. [Importing a pre-generated gene counts file](#Importing-a-Pre-Generated-Counts-Matrix)

#### Snakemake Pipeline
It is recommended you use the available Snakemake pipeline to align to create a properly formatted `COMO inputs` folder. The pipeline also runs a series of quality control steps to help determine if any of the provided samples are not suitable for model creation. This pipeline can be found at https://github.com/HelikarLab/FastqToGeneCounts.

The folder output from the snakemake pipeline can be uploaded directly to the folder `data/COMO inputs` in this pipeline

Once this is done, continue to the code block at the end of this section

#### Creating a Properly Formatted `COMO inputs` Folder


If you are using your own alignment protocol, follow this section to create a properly formatted `COMO inputs` folder.

The top-level of the directory will have separate tissue/cell types that models should be created from. The next level must have a folder called `geneCounts`, and optionally a `strandedness` folder. If you are using zFPKM normalization, two additional folders must be included: `layouts` and `fragmentSizes`. Inside each of these folders should be folders named `SX`, where `X` is a number that replicates are associated with.

<br>

<ins>Gene Counts</ins>
Create a folder named `geneCounts`. The outputs of the STAR aligner using the `-quantMode GeneCounts` option should be included inside the "study-number" folders (`SX`) of `geneCounts`. To help you (and COMO!) stay organized, these outputs should be renamed `tissueName_SXRYrZ.tab`. Just like above, `X` is the study number, `Y` is the replicate number, and (if present), `Z` is the run number. If the replicate does not contain multiple runs, the `rZ` should be excluded from the name. Replicates should come from the same study/sample group. Different samples can come from different published studies as long as the experiments were performed under similar enough conditions for your modeling purposes.

<ins>Strandedness</ins>
Create a folder named `strandedness`. This folder should contain files named `tissueName_SXRYrZ_strandedness.txt`. These files must tell the strandedness of the RNA-sequencing method used. It should contain one of the following texts (and nothing else):
    - `NONE`: If you don't know the strandedness
    - `FIRST_READ_TRANSCRIPTION_STRAND`: If this RNA-sequencing sample originates from the first strand of cDNA, or the "antisense" strand
    - `SECOND_READ_TRANSCRIPTION_STRAND`: If this RNA-sequencing sample originates from the second strand of cDNA, or the "sense" strand

<ins>Layouts</ins>
Create a folder a folder named `layouts`. Files should be named `tissueName_SXRYrZ_layout.txt, where each file tells the layout of the library used. It must contain one of the following texts, and nothing else:
- `paired-end`: Paired-end reads were generated
- `single-end`: Single-end reads were generated

<ins>Fragment Sizes</ins>
Create a folder named `fragmentSizes`. Files should be named `tissueName_SXRYrZ_fragment_sizes.txt` and contain the output of [RSeQC](https://rseqc.sourceforge.net/)'s `como/RNA_fragment_size.py` function.

<ins>Preparation Methods</ins>
Create a folder named `prepMethods`. Files should be named `tissueName_SXRYrZ_prep_method.txt`. Each file should tell the library preparation strategy. It must contain one of the following texts, and nothing else:
- `total`: All mRNA expression was measured (mRNA, ncRNA, rRNA, etc.)
- `mRNA`: Only polyA mRNA expression was measured

It should be noted that these strategies only serve to differentiate the methods in the event that both are used to build a model. If a different library strategy is desired, you have two options:
1. Replace one of these with a placeholder. If you only have polyA mRNA expression, you only have to enter data for those samples. Do not fill out any samples with `total`.
2. With a little Python knowledge, a new strategy can easily be added to the `como/merge_xomics.py` file. If you would like to do so, the file is located under `como/merge_xomics.py` in this Jupyter Notebook

#### Importing a Pre-Generated Counts Matrix
Import a properly formatted counts matrix to `data/data_matrices/exampleTissue/gene_counts_matrix_exampleTissue.csv`. The rows should be named `exampleTissue_SXRY` (note the lack of a run number (`rZ`), runs should be summed into each replicate). If you are providing the count matrix this way, instead of generating one using the snakemake pipeline mentioned above, you must create a configuration file that has each sample's name, study number, and if using zFPKM, layout and mean fragment length. Use the provided template below to create yours. Once you have created this file and placed it under the `data/data_matrices/exampleTissue` directory, run the `como/rnaseq_preprocess.py` file with `preprocess-mode` set to `provide-matrix`.

This method is best if you are downloading a premade count matrix, or using single-cell data that has already been batch corrected, clustered, and sorted into only the cell type of interest!


<table style="border: 1px solid black; border-collapse: collapse;">
    <thead>
        <tr>
            <th colspan="1000" style="text-align: center; border-bottom: 1px solid black;">Example Gene Count Table</th>
        </tr>
    </thead>
    <tbody>
        <tr>
            <td style="padding: 10px; text-align: center; border-bottom: 1px solid black;">genes</td>
            <td style="padding: 10px; text-align: center; border-left: 1px solid black; border-bottom: 1px solid black;">exampleTissue_S1R1</td>
            <td style="padding: 10px; text-align: center; border-left: 1px solid black; border-bottom: 1px solid black;">exampleTissue_S1R2</td>
            <td style="padding: 10px; text-align: center; border-left: 1px solid black; border-bottom: 1px solid black;">exampleTissue_S2R1</td>
            <td style="padding: 10px; text-align: center; border-left: 1px solid black; border-bottom: 1px solid black;">exampleTissue_S2R2</td>
        </tr>
        <tr>
            <td style="padding: 10px; text-align: center;">ENSG00000000003</td>
            <td style="padding: 10px; text-align: center;">20</td>
            <td style="padding: 10px; text-align: center;">29</td>
            <td style="padding: 10px; text-align: center;">52</td>
            <td style="padding: 10px; text-align: center;">71</td>
        </tr>
        <tr>
            <td style="padding: 10px; text-align: center;">ENSG00000000005</td>
            <td style="padding: 10px; text-align: center;">0</td>
            <td style="padding: 10px; text-align: center;">0</td>
            <td style="padding: 10px; text-align: center;">0</td>
            <td style="padding: 10px; text-align: center;">0</td>
        </tr>
        <tr>
            <td style="padding: 10px; text-align: center;">ENSG00000000419</td>
            <td style="padding: 10px; text-align: center;">1354</td>
            <td style="padding: 10px; text-align: center;">2081</td>
            <td style="padding: 10px; text-align: center;">1760</td>
            <td style="padding: 10px; text-align: center;">3400</td>
        </tr>
    </tbody>
</table>

### RNA-seq Preprocessing Parameters
- `context_names`: The tissue/cell types to use. This is a simple space-separated list of items, such as "naiveB regulatoryTcell"
- `gene_format`: The format of input genes, accepts `"Extrez"`, `"Emsembl"` or `"Symbol"`
- `taxon_id`: The [NCBI Taxon ID](https://www.ncbi.nlm.nih.gov/taxonomy) to use
- `preprocess_mode`: This should be set to `"create-matrix"` if you are **not** providing a matrix, otherwise set it to `"provide-matrix"`

In [None]:
context_names = "naiveB"
gene_format = "Ensembl"  # accepts "Entrez", "Ensembl", and "Symbol"
taxon_id = "human"  # accepts integer (bioDBnet taxon id) or "human" or "mouse"
preprocess_mode = "create-matrix"  # "create-matrix" or "provide-matrix"

# fmt: off
cmd = " ".join(
    [
        "python3", "como/rnaseq_preprocess.py",
        "--context-names", f"{context_names}",
        "--gene-format", f"{gene_format}",
        "--taxon-id", f"{taxon_id}",
        f"--{preprocess_mode}",
    ]
)
# fmt: on

!{cmd}


## Identification of Gene Activity in Transcriptomic and Proteomic Datasets

This part of Step 1 will identify gene activity in the following data sources:
- RNA-seq (total, mRNA, or single cell)
- Proteomics

Only one source is required for model generation, but multiple sources can be helpful for additional validation if they are of high enough quality

### Filtering Raw Counts
Regardless of normalization technique used, or provided files used for RNA-seq, preprocessing is required to fetch relevent gene information needed for harmonization and normalization such as Entrez ID, and the start and end postions. Currently, COMO can filter raw RNA-sequencing counts using one of the following normalization techniques:

#### Transcripts Per Million Quantile
TPM Quantile. Each replicate is normalized with Transcripts-Per-Million, and an upper quantiile is taken to create a boolean list of active genes for the replicate (i.e., `R1`). Replicates are compared for consensus within the study, and then studies are compared between one another for additional consensus. The strictness of the consensus easily be set using the appropriate option within the `rnaseq_gen.py` code-block.

This method is recommended if you want more control over the size of the model; smaller models can include only the most expressed reactions, and larger models can encompass less essentail reactions

#### zFPKM
This method is outlined by [Hart et. al](https://pubmed.ncbi.nlm.nih.gov/24215113/). Counts will be normalized using zFPKM and genes > -3 will be considered "expressed" per Hart's recommendation. Expressed genes will be checked for consensus at the replicate and study level.

This method is recommended if you want to less control over which genes are essential, and instead use the most standardized method of active gene determination. This method is more "hands-off" than the above TPM Quantile method.

#### Counts Per Million
This is a flat cutoff value of counts per million normalized values. Gene expression will be checked for consensus at the replicate and study level.

This method is not recommended, as zFPKM is much more robust for a similar level of "hands-off" model building


### RNA Sequencing Analysis
#### Bulk RNA Sequencing

This has multiple strategies of library preparation (total, polyA-mRNA). If you are using public data, you may encounter a situation where you would like to use a combination of bulk RNA sequencing data produced using two different library preparation strategies.

COMO currently supports the two most common strategies, mRNA polyA enriched RNA sequencing, and total RNA sequencing. Because of the expected differences in distribution of transcripts, COMO is written to handle each strategy seperately before the integration step. The recommended Snakemake alignment pipeline is designed to work with COMO's preprocessing step ([Step 1, above](Step-1:-Initialize-and-Preprocess-RNA-seq-data)) to split RNA sequencing data from GEO into seperate input matrices and configuration files.

To create a gene expression file for total RNA sequencing data, use `"total"` for the "`--library-prep`" argument.
To create a gene expression file for mRNA polyA enriched data, use `"mRNA"` for the  "`--library-prep`" argument.

The analysis of each strategy is identical. Specifying the type of analysis (total vs mRNA) only serves to ensure COMO analyzes them seperately.

#### Single Cell RNA Sequencing
While the Snakemake pipeline does not yet support single-cell alignment, and COMO does not yet support automated configuration file and counts matrix file creation for single-cell alignment output from STAR, it is possible to use single-cell data to create a model with COMO. Because normalization strategies can be applied to single-cell data in the same way it is applied to bulk RNA sequencing, `como/rnaseq_gen.py` can be used with a provided counts matrix and configuration file, from [Step 1](Step-1:-Initialize-and-Preprocess-RNA-seq-data), above. Just like `"total"` and `"mRNA"`, `como/rnaseq_gen.py` can be executed with `"SC"` as the "`--library-prep`" argument to help COMO differentiate it from any bulk RNA sequencing data if multiple strategies are being used.


### Total RNA Sequencing Generation
#### Parameters
- `trnaseq_config_file`: The configuration filename for total RNA. This file is found under the `data/config_sheets` folder
- `rep_ratio`: The proportion of replicates before a gene is considered "active" in a study
- `group_ratio`: The proportion of studies with expression required for a gene to be considered "active"
- `rep_ratio_h`: The proportion of replicates that must express a gene before that gene is considered "high-confidience"
- `group_ratio_h`: The proportion of studies that must express a gene before that gene is considered "high-confidence"
- `technique`: The technique to use. Options are: `"quantile"`, `"cpm"`, or `"zfpkm"`. The difference in these options is discussed above
- `quantile`: The cutoff Transcripts-Per-Million quantile for filtering
- `min_zfpkm`: The cutoff for Counts-Per-Million filtering
- `prep_method`: The library method used for preparation. Options are: `"total"`, `"mRNA"`, or `"SC"`,


In [None]:
# step 2.2 RNA-seq Analysis for Total RNA-seq library preparation

trnaseq_config_file = "trnaseq_data_inputs_auto.xlsx"
rep_ratio = 0.75
group_ratio = 0.75
rep_ratio_h = 1.0
group_ratio_h = 1.0
technique = "zFPKM"
quantile = 50
min_zfpkm = -3
prep_method = "total"

# fmt: off
cmd = " ".join(
    [
        "python3", "como/rnaseq_gen.py",
        "--config-file", f"{trnaseq_config_file}",
        "--replicate-ratio", f"{rep_ratio}",
        "--batch-ratio", f"{group_ratio}",
        "--high-replicate-ratio", f"{rep_ratio_h}",
        "--high-batch-ratio", f"{group_ratio_h}",
        "--filt-technique", f"{technique}",
        "--min-zfpkm", f"{min_zfpkm}",
        "--library-prep", f"{prep_method}",
    ]
)
# fmt: on

!{cmd}

## mRNA Sequencing Generation
These parameters are identical to the ones listed for [total RNA sequencing](#Total-RNA-Sequencing-Generation), but they are listed again here for ease of reference

### Parameters
- `mrnaseq_config_file`: The configuration filename for total RNA. This file is found under the `data/config_sheets` folder
- `rep_ratio`: The proportion of replicates before a gene is considered "active" in a study
- `group_ratio`: The proportion of studies with expression required for a gene to be considered "active"
- `rep_ratio_h`: The proportion of replicates that must express a gene before that gene is considered "high-confidience"
- `group_ratio_h`: The proportion of studies that must express a gene before that gene is considered "high-confidence"
- `technique`: The technique to use. Options are: `"quantile"`, `"cpm"`, or `"zfpkm"`. The difference in these options is discussed above
- `quantile`: The cutoff Transcripts-Per-Million quantile for filtering
- `min_zfpkm`: The cutoff for Counts-Per-Million filtering
- `prep_method`: The library method used for preparation. Options are: `"total"`, `"mRNA"`, or `"SC"`,


In [None]:
mrnaseq_config_file = "mrnaseq_data_inputs_auto.xlsx"
rep_ratio = 0.75
group_ratio = 0.75
rep_ratio_h = 1.0
group_ratio_h = 1.0
technique = "zfpkm"
quantile = 50
min_zfpkm = -3
prep_method = "mrna"

# fmt: off
cmd = " ".join(
    [
        "python3", "como/rnaseq_gen.py",
        "--config-file", f"{mrnaseq_config_file}",
        "--replicate-ratio", f"{rep_ratio}",
        "--batch-ratio", f"{group_ratio}",
        "--high-replicate-ratio", f"{rep_ratio_h}",
        "--high-batch-ratio", f"{group_ratio_h}",
        "--filt-technique", f"{technique}",
        "--min-zfpkm", f"{min_zfpkm}",
        "--quantile", f"{quantile}",
        "--library-prep", f"{prep_method}",
    ]
)
# fmt: on

!{cmd}

## Proteomics Analysis
The parameters here are mostly the same to total RNA and mRNA sequencing analysis, and are listed here for easier reference

### Parameters
- `proteomics_config_file`: The file path to the proteomics configuration file
- `rep_ratio`: The ratio required before a gene is considered active in the replicate
- `batch_ratio`: The ratio required before a gene is considered active in the study
- `high_rep_ratio`: The ratio required before a gene is considered "high-confidence" in the replicate
- `high_batch_ratio`: The ratio required before a gene is considered "high-confidence" in the study
- `quantile`: The cutoff Transcripts-Per-Million quantile for filtering

In [None]:
proteomics_config_file = "proteomics_data_inputs_paper.xlsx"
rep_ratio = 0.75
batch_ratio = 0.75
high_rep_ratio = 1.0
high_batch_ratio = 1.0
quantile = 25

# fmt: off
cmd = " ".join(
    [
        "python3", "como/proteomics_gen.py",
        "--config-file", f"{proteomics_config_file}",
        "--replicate-ratio", f"{rep_ratio}",
        "--high-replicate-ratio", f"{high_rep_ratio}",
        "--batch-ratio", f"{batch_ratio}",
        "--high-batch-ratio", f"{high_batch_ratio}",
        "--quantile", f"{quantile}",
    ]
)
# fmt: on

!{cmd}

# Cluster Sample Data (Optional)
This step is used to cluster the samples based on their expression values. This can be used to determine which samples are more similar to each other. In a perfect world, one cluster would be created for each context type used. This is done using the `como/cluster_rnaseq.py` script.

To see more about clustering, please visit the [Wikipedia article](https://en.wikipedia.org/wiki/Cluster_analysis)

The parameters for this script are as follows:
- `context_names`: The tissue/cell name of models that should be clustered. This was defined in the first code block, so it is not redefined here
- `filt_technique`: The filtering technique to use; options are: `"zfpkm"`, `"quantile"`, or `"cpm"`
- `cluster_algorithm`: The clustering algorithm to use. Options are: `"mca"` or `"umap"`
- `label`: Should the samples be labeled in the plot? Options are: `"True"` or `"False"`
- `min_dist`: The minimum distance for UMAP clustering. Must be between 0 and 1. Default value is 0.01
- `replicate_ratio`: The ratio of active genes in replicates for a batch/study to be considered active. The default is 0.9
- `batch_ratio`: The ratio of active genes in batches/studies for a context to be considered active. The default is 0.9
- `min_count`: The ratio of active genes in a batch/study for a context to be considered active. The default is `"default"`
- `quantile`: The ratio of active genes in a batch/study for a context to be considered active. The default is 0.5
- `n_neighbors_rep`: N nearest neighbors for replicate clustering. The default is `"default"`, which is the total number of replicates
- `n_neighbors_batch`: N nearest neighbors for batch clustering. The default is `"default"`, which is the total number of batches
- `n_neighbors_context`: N nearest neighbors for context clustering. The default is `"default"`, which is the total number of contexts
- `seed`: The random seed for clustering algorithm initialization. If not specified, `np.random.randint(0, 100000)` is used

In [None]:
filt_technique = "zfpkm"
cluster_algorithm = "umap"
label = True
min_dist = 0.01
replicate_ratio = 0.9
batch_ratio = 0.9
min_count = "default"
quantile = 50
n_neighbors_rep = "default"
n_neighbors_batch = "default"
n_neighbors_context = "default"
seed = -1

# fmt: off
cmd = " ".join(
    [
        "python3", "como/cluster_rnaseq.py",
        "--context-names", f"{context_names}",
        "--filt-technique", f"{filt_technique}",
        "--cluster-algorithm", f"{cluster_algorithm}",
        "--label", f"{label}",
        "--min-dist", f"{min_dist}",
        "--replicate-ratio", f"{replicate_ratio}",
        "--batch-ratio", f"{batch_ratio}",
        "--n-neighbors-rep", f"{n_neighbors_rep}",
        "--n-neighbors-batch", f"{n_neighbors_batch}",
        "--n-neighbors-context", f"{n_neighbors_context}",
        "--min-count", f"{min_count}",
        "--quantile", f"{quantile}",
        "--seed", f"{seed}",
    ]
)
# fmt: on

!{cmd}

## Merge Expression from Different Data Sources

Thus far, active genes have been determined for at least one data source. If multiple data sources are being used, we can merge the active genes from these sources to make a list of active genes that is more comprehensive (or strict!) than any data source on its own.

`como/merge_xomics.py` takes each data source discussed so far as an argument. The other arguments to consider are:
- `--expression-requirement`: The number of data sources with expression required for a gene to be considered active, if the gene is not "high-confidence" for any source. (default: total number of input sources provided)
- `--requirement-adjust`: This is used to adjust the expression requirement argument in the event that tissues have a different number of provided data sources. This does nothing if there is only one tissue type in the configuration files.
    - `"progressive"`: The expression requirement applies to tissue(s) with the lowest number of data sources. Tissues with more than this value will require its genes to be expressed in 1 additional source before it is "active" in the model
    - `"regressive"` (default): The expression requirement applies to the tissue(s) with the largest number of data sources. Tissues with less than this value will require its genes to be expressed in 1 fewer sources before the gene is considered "active" in the model.
    - `"flat"`: The expression requirement is used regardless of differences in the number of data sources provided for different tissues

- `--no-hc`: This flag should be set to prevent high-confidence genes from overriding the expression requirement set.
    - If this flag is not used, any gene that was determined to be "high-confidence" in any input source will cause the gene to be active in the final model, regardless of agreement with other sources
- `--no-na-adjustment`: This flag should be used to prevent genes that are not present in one data source, but are present in others, from subtracting one from the expression requirement.
    - If this flag is not used, any time a gene is "NA" in a source, meaning it was not tested for in the library of that data sources but <ins>was</ins> tested in the library of another source, it will subtract one from the expression requirement.

The adjusted expression requirement will never resolve to be less than one or greater than the number of data sources for a given tissue

### Parameters
The three parameters listed here were used in RNA Sequencing generation, and should not need to be defined. If you did **not** use one of these, simply un-comment it from the command below by placing a "`#`" at the beginning of the appropriate lines
- `trnaseq_config_file`: The file name used in the [total RNA Sequencing](#Total-RNA-Sequencing-Generation) section of the notebook
- `mrnaseq_config_file`: The file name used in the [mRNA Sequencing](#mRNA-Sequencing-Generation) section of the notebook
- `proteomics_config_file`: The file name used in the [proteomics generation](#Proteomics-Analysis) section of the notebook

The following parameters have not been used in a previous section of the notebook, so they are defined in the below code block
- `expression_requirement`: This is the number of sources a gene must be active in for it to be considered active
- `requirement_adjust`: The technique to adjust expression requirement based on differences in number of provided data source types
- `total_rna_weight`: Total RNA-seq weight for merging zFPKM distribution
- `mrna_weight`: mRNA weight for merging zFPKM distribution
- `single_cell_weight`: Single-cell weight for merging zFPKM distribution
- `proteomics_weight`: Proteomic weight for merging zFPKM distribution

Each of the "weights" (`total_rna_weight`, `mrna_weight`, etc.) are used to place a significance on each method. Becuase there are many steps in the Dogma from transcription to translation, the gene expression as seen by total RNA or mRNA sequencing may not be representative of the gene's protein expression, and this its metabolic impact. Because of this, you are able to weight each source more (or less) than another.

In [None]:
expression_requirement = 3
requirement_adjust = "regressive"
total_rna_weight = 6
mrna_weight = 6
single_cell_weight = 6
proteomics_weight = 10

# fmt: off
cmd = " ".join(
    [
        "python3", "como/merge_xomics.py",
        "--merge-distribution",
        "--total-rnaseq-config-file", f"{trnaseq_config_file}",
        "--mrnaseq-config-file", f"{mrnaseq_config_file}",
        # "--scrnaseq-config-file", f"{scrnaseq_config_file}",      # If using single-cell data, uncomment the start of this line
        # "--proteomics-config-file", f"{proteomics_config_file}",
        "--expression-requirement", f"{expression_requirement}",
        "--requirement-adjust", f"{requirement_adjust}",
        "--total-rnaseq-weight", f"{total_rna_weight}",
        "--mrnaseq-weight", f"{mrna_weight}",
        # "--single-cell-rnaseq-weight", f"{single_cell_weight}",             # If using single-cell data, uncomment the start of this line
        "--protein-weight", f"{proteomics_weight}",
        "--no-hc",
    ]
)
# fmt: on

!{cmd}

# Step 2: Create Tissue/Cell-Type Specific Models

## Boundary Reactions
To create a metabolic model, the following information about each metabolite or reaction involved is required:
- **Reaction Type**
    - Exchange
    - Demand
    - Sink
- **Metabolic/Reaction Abbreviation**
    - You can use the [Virutal Metabolic Human](https://www.vmh.life/#home) to look up your metabolite and reaction abbreviations
- **Compartments**
    - Cytosol
    - Extracellular
    - Golgi Apparatus
    - Internal Membranes
    - Lysosome
    - Mitochondria
    - Nucleus
    - Endoplasmic Reticulum
    - Unknown
- **Minimum Reaction Rate**
- **Maximum Reaction Rate**


*Below is an example of a properly formatted table of metabolic and reaction information*

| Reaction | Abbreviation |    Compartment     | Minimum Reaction Rate | Maximum Reaction Rate |
|:--------:|:------------:|:------------------:|:---------------------:|:---------------------:|
| Exchange |    glc_D     |   Extracellular    |         -100          |         1000          |
|  Demand  |  15HPETATP   |      Cytosol       |          -1           |         1000          |
|   Sink   |    met_L     | Internal Membranes |         -1000         |           1           |


These reactions should be placed into a CSV file; a template can be found at `data/boundary_rxns/default_force_rxns.csv`. Append your reactions to this file, and remove any that are not required. COMO will load this file in during model creation

## Force Reactions
Force reactions are reactions that should **always** be included in the model, no matter their flux value in the metabolic data provided. In contrast to the boundary reaction list, this is simply a list of reaction names that should be "forced" through the model. Append your force reactions to the `data/force_rxns/default_force_rxns.csv` file, and remove any that are not required. COMO will load this file during model creation

*Below is an example of a properly formatted table of force reactions*

| Reaction |
|:--------:|
|  glc_D   |
|  met_L   |

## Adding Reference Models
This Jupyter notebook uses Recon3D's [Virtual Metabolic Human](https://www.vmh.life/) as a base to map reactions onto, and is included with the Jupyter notebook. If you would like to include other reference models, simply upload them to the `data` folder, and set the name of the `general_model_file` below to the name of your reference model.

## Parameters
The following is a list of parameters and their function in this section of the pipeline
- `low_thres`: If you are using the `IMAT` reconstruction algorithm, gene expression above this value will be placed in the "mid-expression" bin
- `high_thres`: If you are using the `IMAT` reconstruction algorithm, gene expression above this value will be placed in the "high-expression" bin
- `output_filetypes`: These are the file types you would like to save your model as. It should be one (or multiple) of the following: `"xml"`, `"mat"`, `"json"`
- `objective_dict`: This is an objective the model should be solved for. Popular options are `"biomass_reaction"` or `"biomass_maintenance"`
- `general_model_file`: This is the reference model file to load
- `recon_algorithm`: The troppo reconstruction algorithm to use. This should be one of the following: `"FastCORE"`, `"CORDA"`, `"GIMME"`, `"tINIT"`, `"IMAT"`
- `solver`: The solver to use for optimizing the model. Options are: `"GUROBI"` or `"GLPK"`
- `boundary_reactions_filename`: The filename of boundary reactions that should be used
- `force_reactions_filename`: The filename of the force reactions to be used. Force reactions will (as the name implies) force the optimizer to use these reactions, **no matter their expression**
- `exclude_reactions_filename`: The filename of reactions to exclude from the model, no matter their expression

In [None]:
# Set your objectives before running!
objective_dict = {"naiveB": "biomass_maintenance", "smB": "biomass_maintenance"}
# -----------------

low_threshold = -5
high_threshold = -3
output_filetypes = "xml mat json"
general_model_file = "GeneralModelUpdatedV2.mat"
recon_algorithms = ["IMAT"]
solver = "GUROBI"

import json
import os
from pathlib import Path

from como.project import Config

config = Config()

# Load the output of step 1, which is a dictionary that specifies the merged list of active Gene IDs for each tissue
step1_results_file = os.path.join(config.data_dir, "results", "step1_results_files.json")
with open(step1_results_file) as json_file:
    context_gene_exp = json.load(json_file)

for recon_algorithm in recon_algorithms:
    for context in context_gene_exp.keys():
        objective = objective_dict[context]

        if recon_algorithm.upper() in ["IMAT", "TINIT"]:
            active_genes_filepath = os.path.join(config.data_dir, "results", context, f"model_scores_{context}.csv")
        else:
            gene_expression_file = context_gene_exp[context]
            active_genes_filename = Path(gene_expression_file).name
            active_genes_filepath = os.path.join(config.data_dir, "results", context, active_genes_filename)

        general_model_filepath = os.path.join(config.data_dir, "GeneralModelUpdatedV2.mat")
        boundary_reactions_filepath = os.path.join(config.data_dir, "boundary_rxns", f"{context}_boundary_rxns.csv")
        force_reactions_filepath = os.path.join(config.data_dir, "force_rxns", f"{context}_force_rxns.csv")
        exclude_reactions_filepath = os.path.join(config.data_dir, "exclude_rxns", f"{context}_exclude_rxns.csv")

        # fmt: off
        cmd = " ".join(
            [
                "python3", "como/create_context_specific_model.py",
                "--context", f"{context}",
                "--reference-model-filepath", f"{general_model_filepath}",
                "--active-genes-filepath", f"{active_genes_filepath}",
                "--objective", f"{objective}",
                "--boundary-reactions-filepath", f"{boundary_reactions_filepath}",
                # "--exclude-reactions-filepath", f"{exclude_reactions_filepath}",
                "--force-reactions-filepath", f"{force_reactions_filepath}",
                "--algorithm", f"{recon_algorithm}",
                "--low-threshold", f"{low_threshold}",
                "--high-threshold", f"{high_threshold}",
                "--solver", f"{solver}",
                "--output-filetypes", f"{output_filetypes}",
            ]
        )
        # fmt: on
        !{cmd}

# Generate MEMOTE Reports
> NOTE: This step is entirely optional

MEMOTE is an open-source tool to automate the testing and reporting of metabolic models. This report is a detailed summary of the tests performed by MEMOTE on a given metabolic model (i.e., the one you just generated), along with the results and recommendations for improving the model. In order to create these reports, a metabolic "map" is required. Several of these are included in COMO, found under `data/maps/RECON1`. If you would like to add your own maps, they can be included in multiple places:
1. If you have mapped a `local_files` directory to the container before starting, you can simply copy-and-paste them into the `local_files/maps` directory using the file browser of your computer. This is the most robust solution because the files will not be deleted by the container after it stops, or if it is updated in the future
2. You can upload them to the Jupyter notebook under the `data/maps` directory. The code block below will search for any `.json` files that are not already included in the `map_dict` dictionary

config.data_dir,
                "results",
                context,
                "figures",
                f"{key}_map_{context}_{algorithm}.html"

The resulting MEMOTE reports will be saved to `data/results/exampleTissue/figures/mapName_map_exampleTissue_ALGORITHM.html`.

- `mapName`: This is the name of the map file. In the `map_dict` dictionary below, this value would be `trypto`, `retinol`, etc.
- `exampleTissue`: This is the name of the tissue context
- `ALGORITHM`: This is the algorithm (`recon_algorithm`) used in the above model creation step


In [None]:
import os
from pathlib import Path

import cobra
from como.project import Config
from escher import Builder

config = Config()

user_map_dir = Path(f"{config.data_dir}/local_files/maps/")
map_dict = {
    "trypto": f"{config.data_dir}/maps/RECON1/RECON1.tryptophan_metabolism.json",
    # "lipid": f"{config.data_dir}/maps/RECON1/RECON1.",  # Not present in COMO by default yet
    "retinol": f"{config.data_dir}/maps/RECON1/RECON1.inositol_retinol_metabolism.json",
    "glyco": f"{config.data_dir}/maps/RECON1/RECON1.glycolysis_TCA_PPP.json",
    "combined": f"{config.data_dir}/maps/RECON1/RECON1.combined.json",
    "carbo": f"{config.data_dir}/maps/RECON1/RECON1.carbohydrate_metabolism.json",
    "amino": f"{config.data_dir}/maps/RECON1/RECON1.amino_acid_partial_metabolism.json",
}

# Collect files from user-input json maps
index = 1
for file in user_map_dir.glob("**/*.json"):
    map_dict[file.stem] = file
    index += 1

# Collect any additional maps under the `{config.data_dir}/maps/` directory
for file in Path(f"{config.data_dir}/maps").glob("**/*.json"):
    if file not in map_dict.values():
        map_dict[file.stem] = file

for recon_algorithm in recon_algorithms:
    for context in context_gene_exp.keys():
        # for context in ["naiveB", "smB"]:
        print(f"Starting {context}")
        model_json = os.path.join(config.data_dir, "results", context, f"{context}_SpecificModel_{recon_algorithm}.json")

        print(f"Loading '{context}', this may take some time...")
        model = cobra.io.load_json_model(model_json)
        for key in map_dict.keys():
            print(f"Running with: {key}")
            builder = Builder(map_json=str(map_dict[key]))
            builder.model = model
            solution = cobra.flux_analysis.pfba(model)
            builder.reaction_data = solution.fluxes
            builder.reaction_scale = [
                {"type": "min", "color": "#ff3300", "size": 12},
                {"type": "q1", "color": "#ffc61a", "size": 14},
                {"type": "median", "color": "#ffe700", "size": 16},
                {"type": "q3", "color": "#4ffd3c", "size": 18},
                {"type": "max", "color": "#3399ff", "size": 20},
            ]
            builder.reaction_no_data_color = "#8e8e8e"

            builder.save_html(os.path.join(config.data_dir, "results", context, "figures", f"{key}_map_{context}_{recon_algorithm}.html"))

        out_dir = os.path.join(config.data_dir, "results", context)
        # for algorithm in ["GIMME", "IMAT", "FASTCORE", "tINIT"]:
        report_file = os.path.join(out_dir, f"memote_report_{context}_{recon_algorithm}.html")
        model_file = os.path.join(out_dir, f"{context}_SpecificModel_{recon_algorithm}.xml")
        log_dir = os.path.join(out_dir, "memote")
        log_file = os.path.join(log_dir, f"{context}_{recon_algorithm}_memote.log")

        if not os.path.exists(log_dir):
            os.mkdir(log_dir)

        cmd = " ".join(["memote", "report", "snapshot", "--filename", f"{report_file}", f"{model_file}", ">", f"{log_file}"])

        !{cmd}

# Step 3: Disease-related Gene Identification
This step can identify disease related genes by analyzing patient transcriptomics' data

In the `data/config_sheets` folder, create another folder called `disease`. Add an Excel file for each tissue/cell type called `disease_data_inputs_<TISSUE_NAME>`, where `<TISSUE_NAME>` is the name of the tissue you are interested in. Each sheet of this file should correspond to a separate disease to analyze using differential gene analysis. The file is formatted in the same fashion as described in the [final part of Step 1](#Importing-a-Pre-Generated-Counts-Matrix). The sheet names should be in the following format: `<DISEASE_NAME>_bulk`
- `<DISEASE_NAME>`: This is the name of the disease you are analyzing.

For example, if the disease we are interested in is lupus, and the source of the data is bulk RNA sequencing, the name of the first sheet would be `lupus_bulk`. If you are using bulk RNA sequencing, there should be a gene counts matrix file located at `data/data_matrices/<tissue_name>/<disease>` called `BulkRNAseqDataMatrix<DISEASE_NAME>_<TISSUE_NAME>`

## Parameters
- `disease_names`: The diseases you are using. This should match the first section of the sheet name in the Excel file
- `data_source`: The datasource you are using for disease analysis. This should be`"rnaseq"`
- `taxon_id`: The [NCBI Taxon ID](https://www.ncbi.nlm.nih.gov/taxonomy) to use for disease analysis

In [None]:
disease_names = ["arthritis", "lupus_a", "lupus_b"]
data_source = "rnaseq"
taxon_id = "human"

from como.como_utilities import stringlist_to_list

for context_name in stringlist_to_list(context_names):
    disease_config_file = f"disease_data_inputs_{context_name}.xlsx"

    # fmt: off
    cmd = " ".join(
        [
            "python3", "como/disease_analysis.py",
            "--context-name", f"{context_name}",
            "--config-file", f"{disease_config_file}",
            "--data-source", f"{data_source}",
            "--taxon-id", f"{taxon_id}",
        ]
    )
    # fmt: on

    !{cmd}

# Step 4: Drug Targets & Repurposable Drug Identification
This step performs a series of tasks:
1. Map drug targets in metabolic models
2. Performs knock out simulations
3. Compares simulation results with "disease genes"
4. Identifies drug targets and repurposable drugs

## Execution Steps
### Drug Database
A processed drug-target file is included in the `data` folder, called `Repurposing_Hub_export.txt`. If you would like to include an additional drug-target file, please model your own file after the included one. Alternatively, if you would like to update to a newer version of the database, simply export from the [Drug Repurposing Hub](https://clue.io/repurposing-app). If you do this, remove all `activators`, `agonists`, and `withdrawn` drugs. Replace the `data/Repurposing_Hub_export.txt` file.

### Using Automatically Created Models
This step will use the models generated in Step 4, above. It is **highly** recommended to use refined and validated models for further analysis (i.e., before running this step of the pipeline). If you would like to use a custom model, instead of the one created by COMO, edit the `model_files` dictionary. An example is shown here:
```python
model_files = {
   "exampleTissueModel": "/home/jovyan/main/data/myModels/exampleTissueModel.mat",
   "anotherTissueModel": "/home/jovyan/main/data/myModels/anotherTissueModel.json",
   "thirdTissueModel": "/home/jovyan/main/data/myModels/thirdTissueModel.xml"
}
```

❗The path `/home/jovyan/main/` **<ins>MUST</ins>** stay the same. If it does not, your model **will not be found**


## Parameters
Other than the `model_files` parameter (if required), the only other parameter for this section is the `solver` option

- `solver`: The solver you would like to use. Available options are `"gurobi"` or `"glpk"`


In [None]:
# Knock out simulation for the analyzed tissues and diseases
model_files = {
    # "context_name": "/path/to/model.mat"
    # EXAMPLE -> "Treg": "/home/jovyan/main/data/results/naiveB/naiveB_SpecificModel_IMAT.mat"
}
sovler = "gurobi"

import json
import os

from como.como_utilities import stringlist_to_list
from como.project import Config

config = Config()

drug_raw_file = "Repurposing_Hub_export.txt"
for context in stringlist_to_list(context_names):
    for recon_algorithm in recon_algorithms:
        for disease in disease_names:
            disease_path = os.path.join(config.data_dir, "results", context, disease)
            out_dir = os.path.join(config.data_dir, "results", context, disease)
            tissue_gene_folder = os.path.join(config.data_dir, context)
            os.makedirs(tissue_gene_folder, exist_ok=True)

            if not os.path.exists(disease_path):
                print(f"Disease path doesn't exist! Looking for {disease_path}")
                continue

            # load the results of step 3 to dictionary "disease_files"
            step3_results_file = os.path.join(config.data_dir, "results", context, disease, "step2_results_files.json")

            with open(step3_results_file) as json_file:
                disease_files = json.load(json_file)
                down_regulated_disease_genes = disease_files["down_regulated"]
                up_regulated_disease_genes = disease_files["up_regulated"]

            if context in model_files.keys():
                tissueSpecificModelfile = model_files[context]
            else:
                tissueSpecificModelfile = os.path.join(config.data_dir, "results", context, f"{context}_SpecificModel_{recon_algorithm}.mat")

            # fmt: off
            cmd = [
                "python3", "como/knock_out_simulation.py",
                "--context-model", f"{tissueSpecificModelfile}",
                "--context-name", f"{context}",
                "--disease-name", f"{disease}",
                "--disease-up", f"{up_regulated_disease_genes}",
                "--disease-down", f"{down_regulated_disease_genes}",
                "--raw-drug-file", f"{drug_raw_file}",
                "--solver", f"{sovler}",
                # "--test-all"
            ]
            # fmt: on

            if recon_algorithm == "IMAT":
                cmd.extend(["--reference-flux-file", os.path.join(config.data_dir, "results", context, "IMAT_flux.csv")])

            cmd = " ".join(cmd)
            !{cmd}