# GEM-PRO - Genes & Sequences

This notebook gives an example of how to run the GEM-PRO pipeline with a **dictionary of gene IDs and their protein sequences**.

<div class="alert alert-info">

**Input:** 
Dictionary of gene IDs and protein sequences

</div>

<div class="alert alert-info">

**Output:**
GEM-PRO model

</div>

## Imports

In [None]:
import sys
import logging

In [None]:
# Import the GEM-PRO class
from ssbio.pipeline.gempro import GEMPRO

In [None]:
# Printing multiple outputs per cell
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

### Logging

Set the logging level in `logger.setLevel(logging.<LEVEL_HERE>)` to specify how verbose you want the pipeline to be. Debug is most verbose.

- `CRITICAL`
     - Only really important messages shown
- `ERROR`
     - Major errors
- `WARNING`
     - Warnings that don't affect running of the pipeline
- `INFO` (default)
     - Info such as the number of structures mapped per gene
- `DEBUG`
     - Really detailed information that will print out a lot of stuff
     
<div class="alert alert-warning">

**Warning:** 
`DEBUG` mode prints out a large amount of information, especially if you have a lot of genes. This may stall your notebook!
</div>

In [None]:
# Create logger
logger = logging.getLogger()
logger.setLevel(logging.INFO)  # SET YOUR LOGGING LEVEL HERE #

In [None]:
# Other logger stuff for Jupyter notebooks
handler = logging.StreamHandler(sys.stderr)
formatter = logging.Formatter('[%(asctime)s] [%(name)s] %(levelname)s: %(message)s', datefmt="%Y-%m-%d %H:%M")
handler.setFormatter(formatter)
logger.handlers = [handler]

## Initialization of the project

Set these three things:

- `ROOT_DIR`
    - The directory where a folder named after your `PROJECT` will be created
- `PROJECT`
    - Your project name
- `LIST_OF_GENES`
    - Your list of gene IDs
    
A directory will be created in `ROOT_DIR` with your `PROJECT` name. The folders are organized like so:
```
    ROOT_DIR
    └── PROJECT
        ├── data  # General storage for pipeline outputs
        ├── model  # SBML and GEM-PRO models are stored here
        ├── genes  # Per gene information
        │   ├── <gene_id1>  # Specific gene directory
        │   │   └── protein
        │   │       ├── sequences  # Protein sequence files, alignments, etc.
        │   │       └── structures  # Protein structure files, calculations, etc.
        │   └── <gene_id2>
        │       └── protein
        │           ├── sequences
        │           └── structures
        ├── reactions  # Per reaction information
        │   └── <reaction_id1>  # Specific reaction directory
        │       └── complex
        │           └── structures  # Protein complex files
        └── metabolites  # Per metabolite information
            └── <metabolite_id1>  # Specific metabolite directory
                └── chemical
                    └── structures  # Metabolite 2D and 3D structure files
                
```

<div class="alert alert-info">**Note:** Methods for protein complexes and metabolites are still in development.</div>

In [None]:
# SET FOLDERS AND DATA HERE
import tempfile
ROOT_DIR = tempfile.gettempdir()

PROJECT = 'genes_and_sequences_GP'
GENES_AND_SEQUENCES = {'b0870': 'MIDLRSDTVTRPSRAMLEAMMAAPVGDDVYGDDPTVNALQDYAAELSGKEAAIFLPTGTQANLVALLSHCERGEEYIVGQAAHNYLFEAGGAAVLGSIQPQPIDAAADGTLPLDKVAMKIKPDDIHFARTKLLSLENTHNGKVLPREYLKEAWEFTRERNLALHVDGARIFNAVVAYGCELKEITQYCDSFTICLSKGLGTPVGSLLVGNRDYIKRAIRWRKMTGGGMRQSGILAAAGIYALKNNVARLQEDHDNAAWMAEQLREAGADVMRQDTNMLFVRVGEENAAALGEYMKARNVLINASPIVRLVTHLDVSREQLAEVAAHWRAFLAR',
                       'b3041': 'MNQTLLSSFGTPFERVENALAALREGRGVMVLDDEDRENEGDMIFPAETMTVEQMALTIRHGSGIVCLCITEDRRKQLDLPMMVENNTSAYGTGFTVTIEAAEGVTTGVSAADRITTVRAAIADGAKPSDLNRPGHVFPLRAQAGGVLTRGGHTEATIDLMTLAGFKPAGVLCELTNDDGTMARAPECIEFANKHNMALVTIEDLVAYRQAHERKAS'}
PDB_FILE_TYPE = 'mmtf'

In [None]:
# Create the GEM-PRO project
my_gempro = GEMPRO(gem_name=PROJECT, root_dir=ROOT_DIR, genes_and_sequences=GENES_AND_SEQUENCES, pdb_file_type=PDB_FILE_TYPE)

## Mapping sequence --> structure

Since the sequences have been provided, we just need to BLAST them to the PDB.

<p><div class="alert alert-info">**Note:** These methods do not download any 3D structure files.</div></p>

### Methods

In [None]:
# Mapping using BLAST
my_gempro.blast_seqs_to_pdb(all_genes=True, seq_ident_cutoff=.9, evalue=0.00001)
my_gempro.df_pdb_blast.head(2)

## Downloading and ranking structures

### Methods

<div class="alert alert-warning">

**Warning:** 
Downloading all PDBs takes a while, since they are also parsed for metadata. You can skip this step and just set representative structures below if you want to minimize the number of PDBs downloaded.

</div>

In [None]:
# Download all mapped PDBs and gather the metadata
my_gempro.pdb_downloader_and_metadata()
my_gempro.df_pdb_metadata.head(2)

In [None]:
# Set representative structures
my_gempro.set_representative_structure()
my_gempro.df_representative_structures.head()

In [None]:
# Looking at the information saved within a gene
my_gempro.genes.get_by_id('b0870').protein.representative_structure
my_gempro.genes.get_by_id('b0870').protein.representative_structure.get_dict()

## Creating homology models

For those proteins with no representative structure, we can create homology models for them. `ssbio` contains some built in functions for easily running [I-TASSER](http://zhanglab.ccmb.med.umich.edu/I-TASSER/download/) locally or on machines with `SLURM` (ie. on NERSC) or `Torque` job scheduling.

You can load in I-TASSER models once they complete using the `get_itasser_models` later.

<p><div class="alert alert-info">**Info:** Homology modeling can take a long time - about 24-72 hours per protein (highly dependent on the sequence length, as well as if there are available templates).</div></p>

### Methods

In [None]:
# Prep I-TASSER model folders
my_gempro.prep_itasser_modeling('~/software/I-TASSER4.4', '~/software/ITLIB/', runtype='local', all_genes=False)

## Saving your GEM-PRO

<p><div class="alert alert-warning">**Warning:** Saving is still experimental. For a full GEM-PRO with sequences & structures, depending on the number of genes, saving can take >5 minutes.</div></p>

In [None]:
import os.path as op
my_gempro.save_json(op.join(my_gempro.model_dir, '{}.json'.format(my_gempro.id)), compression=False)